双向类别
决定字符在双向文本中(LTR、RTL、弱、中性)行为方式的属性,由Unicode双向算法用于确定显示顺序。
What Is the Bidirectional Category?
The Bidi Class (formally Bidi_Class, also called bidirectional category) is a Unicode property that controls how characters are positioned in a line of mixed left-to-right (LTR) and right-to-left (RTL) text. It is the primary input to the Unicode Bidirectional Algorithm (UBA, described in Unicode Standard Annex #9), which determines the visual display order of characters in a paragraph.
Every code point is assigned one of 23 Bidi Class values. The algorithm uses these values—along with explicit directional override characters—to resolve the correct rendering order for Arabic mixed with English, Hebrew mixed with numbers, or any other bidirectional combination.
The Major Bidi Class Values
| Code | Name | Typical Characters |
|---|---|---|
| L | Left-to-Right | Latin letters, digits in LTR context |
| R | Right-to-Left | Hebrew letters |
| AL | Arabic Letter | Arabic and Thaana letters |
| EN | European Number | 0–9 |
| AN | Arabic Number | Arabic-Indic digits ٠–٩ |
| ES | European Separator | + − |
| ET | European Terminator | $ % ° |
| ON | Other Neutral | most punctuation |
| BN | Boundary Neutral | Format chars, ZWJ |
| NSM | Non-Spacing Mark | combining marks (inherit from base) |
| WS | Whitespace | space, tab |
| B | Paragraph Separator | U+2029 |
| S | Segment Separator | tab in certain contexts |
| LRE/RLE/LRO/RLO | Explicit Embedding | directional embedding characters |
| LRM/RLM | Mark | U+200E LEFT-TO-RIGHT MARK, U+200F |
| LRI/RLI/FSI/PDI | Isolate | Unicode 6.3+ directional isolates |
import unicodedata
chars = [("A", "Latin"), ("ب", "Arabic"), ("5", "Digit"),
("\u200F", "RLM"), ("\u200E", "LRM")]
for char, label in chars:
bc = unicodedata.bidirectional(char)
print(f" {label:12} U+{ord(char):04X} Bidi={bc}")
# Latin U+0041 Bidi=L
# Arabic U+0628 Bidi=AL
# Digit U+0035 Bidi=EN
# RLM U+200F Bidi=R
# LRM U+200E Bidi=L
Why It Matters in Practice
Without correct bidi handling, a string like "Hello مرحبا World" will display with the Arabic word in the wrong position or with punctuation displaced. HTML provides dir attributes and the Unicode characters U+200F (RLM), U+200E (LRM), and the bidi isolate characters (U+2066–U+2069) to guide the algorithm. Web developers working with RTL content must understand that the visual order of characters on screen differs from their logical (storage) order.
Quick Facts
| Property | Value |
|---|---|
| Unicode property name | Bidi_Class |
| Short alias | bc |
| Number of values | 23 |
| Python function | unicodedata.bidirectional(char) → string code |
| Algorithm spec | Unicode Standard Annex #9 (UAX #9) |
| Key characters | U+200E LRM, U+200F RLM, U+2066–U+2069 isolates |
相关术语
字符属性 中的更多内容
字符首次被分配时所在的Unicode版本,有助于判断各系统和软件版本的字符支持情况。
Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …
Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …
Unicode property listing all scripts that use a character, broader than the …
将每个码位归入30个类别(Lu、Ll、Nd、So等)之一的分类体系,分为7大类:字母、标记、数字、标点、符号、分隔符和其他。
具有相同抽象内容但外观可能不同的两个字符序列,比规范等价更宽泛,例如fi ≈ fi,² ≈ 2。
将字符映射为其组成部分的过程。规范分解保留语义(é → e + ◌́),兼容分解可能改变语义(fi → fi)。
命名的连续码位范围(如基本拉丁文 = U+0000–U+007F)。Unicode 16.0定义了336个区块,每个码位恰好属于一个区块。
由于稳定性策略规定Unicode名称不可更改,因此提供字符的备用名称,用于更正、缩写和别名。
将字符在大写、小写和标题大小写之间转换的规则,可能因区域设置而异(土耳其语I问题),也存在一对多映射(ß → SS)。