Unicode Standard Annex (UAX)
Normative or informative documents that are integral parts of the Unicode Standard. UAX#9 (Bidi Algorithm), UAX#11 (East Asian Width), UAX#15 (Normalization Forms) are key examples.
What is a Unicode Standard Annex (UAX)?
A Unicode Standard Annex (UAX) is a normative technical document that is an integral part of the Unicode Standard. Unlike Unicode Technical Reports (UTRs), which are informative recommendations, UAXes define algorithms and properties that implementations may be required to follow for conformance. Each UAX is updated in lockstep with Unicode version releases and carries a normative weight comparable to the core chapters of the Unicode Standard itself.
UAXes cover the most complex and widely implemented aspects of Unicode text processing — topics too detailed to fit in the core Standard chapters but too important to remain merely advisory.
Key UAXes
UAX #9 — Unicode Bidirectional Algorithm Defines the algorithm that determines the display order of characters in text containing both left-to-right (LTR) and right-to-left (RTL) scripts, such as English mixed with Arabic or Hebrew. The algorithm assigns directional categories to each character and applies a set of rules to determine visual ordering. Every web browser, word processor, and terminal emulator must implement UAX #9 to correctly display mixed-direction text.
UAX #11 — East Asian Width
Assigns each Unicode character one of six width categories (Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, Neutral) for use in fixed-width terminal display and East Asian layout. Critical for wcwidth() implementations and terminal multiplexers.
UAX #14 — Unicode Line Breaking Algorithm Defines an 85-rule algorithm that assigns "line break opportunities" to character pairs. Enables text renderers to know where it is safe to break a line. Essential for all text layout engines, from web browsers to PDF generators.
UAX #15 — Unicode Normalization Forms Specifies the four normalization forms (NFC, NFD, NFKC, NFKD) that convert text to canonical or compatible equivalents. Normalization is foundational to string comparison, search, and storage in virtually every Unicode-aware system.
UAX #29 — Unicode Text Segmentation
Defines grapheme cluster boundaries (what a user perceives as a single character), word boundaries, and sentence boundaries. Implemented by \b word-boundary assertions in regex engines, cursor movement in text editors, and line-breaking in layout engines.
UAX #31 — Unicode Identifiers and Syntax Specifies which Unicode characters may be used in programming language identifiers (variable names, function names, etc.). Python 3, Java, Rust, and many other languages reference UAX #31 for their identifier rules.
UAX #44 — Unicode Character Database
Documents the structure, content, and semantics of the Unicode Character Database (UCD) — the set of data files that encode all Unicode character properties. Essentially the specification for how to read and interpret files like UnicodeData.txt, DerivedCoreProperties.txt, and hundreds of other data files.
UAX Lifecycle
Research (UTR) → Community feedback → Promoted to UAX → Updated each version
A UAX is permanently associated with a Unicode version number. When you reference UAX #15 revision 10, you know exactly which normalization rules apply.
Quick Facts
| Property | Value |
|---|---|
| Normative status | Normative (integral to Unicode Standard) |
| Naming convention | UAX #N (e.g., UAX #15) |
| Update cadence | Every Unicode version release |
| Most implemented | UAX #9 (Bidi), UAX #15 (Normalization), UAX #29 (Segmentation) |
| Identifier rules | UAX #31 (used by Python, Java, Rust, etc.) |
| UCD documentation | UAX #44 |
| Publication URL | unicode.org/reports/ |
相关术语
Unicode 标准 中的更多内容
中日韩——Unicode中统一汉字区块及相关文字系统的统称,CJK统一表意文字包含20,992个以上字符。
The process of mapping Chinese, Japanese, and Korean ideographs that share a …
The individual consonant and vowel components (jamo) of the Korean Hangul writing …
与Unicode同步的国际标准(ISO/IEC 10646),定义相同的字符集和码位,但不包含Unicode额外的算法和属性。
为每种书写系统中的每个字符分配唯一编号(码位)的通用字符编码标准,16.0版本包含154,998个已分配字符。
Informational documents published by the Unicode Consortium covering specific topics like security …
定义所有Unicode字符属性的机器可读数据文件集合,包括UnicodeData.txt、Blocks.txt、Scripts.txt等。
除代理码位(U+D800–U+DFFF)之外的所有码位,是可表示实际字符的有效值集合,共1,112,064个。
Unicode标准的主要版本,每次发布均新增字符、文字系统和功能,当前版本为Unicode 16.0(2025年9月)。
保证字符一旦分配,其码位和名称永不更改的策略。属性可以精化,但分配是永久性的。