What is Unicode Technical Report (UTR)?

Informational documents published by the Unicode Consortium covering specific topics like security mechanisms (UTR#39), text segmentation (UTR#29), and line breaking (UTR#14).

What is Unicode 正規化?

Unicodeテキストを標準的な正規形に変換するプロセス。4つの形式：NFC（合成）、NFD（分解）、NFKC（互換合成）、NFKD（互換分解）。

What is Unicode 双方向アルゴリズム (UBA)?

文字の双方向カテゴリと明示的な方向オーバーライドを使って、混在方向テキスト（例：英語＋アラビア語）の表示順序を決定するアルゴリズム。

Unicode 標準

Unicode Standard Annex (UAX)

Normative or informative documents that are integral parts of the Unicode Standard. UAX#9 (Bidi Algorithm), UAX#11 (East Asian Width), UAX#15 (Normalization Forms) are key examples.

What is a Unicode Standard Annex (UAX)?

A Unicode Standard Annex (UAX) is a normative technical document that is an integral part of the Unicode Standard. Unlike Unicode Technical Reports (UTRs), which are informative recommendations, UAXes define algorithms and properties that implementations may be required to follow for conformance. Each UAX is updated in lockstep with Unicode version releases and carries a normative weight comparable to the core chapters of the Unicode Standard itself.

UAXes cover the most complex and widely implemented aspects of Unicode text processing — topics too detailed to fit in the core Standard chapters but too important to remain merely advisory.

Key UAXes

UAX #9 — Unicode Bidirectional Algorithm Defines the algorithm that determines the display order of characters in text containing both left-to-right (LTR) and right-to-left (RTL) scripts, such as English mixed with Arabic or Hebrew. The algorithm assigns directional categories to each character and applies a set of rules to determine visual ordering. Every web browser, word processor, and terminal emulator must implement UAX #9 to correctly display mixed-direction text.

UAX #11 — East Asian Width Assigns each Unicode character one of six width categories (Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, Neutral) for use in fixed-width terminal display and East Asian layout. Critical for wcwidth() implementations and terminal multiplexers.

UAX #14 — Unicode Line Breaking Algorithm Defines an 85-rule algorithm that assigns "line break opportunities" to character pairs. Enables text renderers to know where it is safe to break a line. Essential for all text layout engines, from web browsers to PDF generators.

UAX #15 — Unicode Normalization Forms Specifies the four normalization forms (NFC, NFD, NFKC, NFKD) that convert text to canonical or compatible equivalents. Normalization is foundational to string comparison, search, and storage in virtually every Unicode-aware system.

UAX #29 — Unicode Text Segmentation Defines grapheme cluster boundaries (what a user perceives as a single character), word boundaries, and sentence boundaries. Implemented by \b word-boundary assertions in regex engines, cursor movement in text editors, and line-breaking in layout engines.

UAX #31 — Unicode Identifiers and Syntax Specifies which Unicode characters may be used in programming language identifiers (variable names, function names, etc.). Python 3, Java, Rust, and many other languages reference UAX #31 for their identifier rules.

UAX #44 — Unicode Character Database Documents the structure, content, and semantics of the Unicode Character Database (UCD) — the set of data files that encode all Unicode character properties. Essentially the specification for how to read and interpret files like UnicodeData.txt, DerivedCoreProperties.txt, and hundreds of other data files.

UAX Lifecycle

Research (UTR) → Community feedback → Promoted to UAX → Updated each version

A UAX is permanently associated with a Unicode version number. When you reference UAX #15 revision 10, you know exactly which normalization rules apply.

Quick Facts

Property	Value
Normative status	Normative (integral to Unicode Standard)
Naming convention	UAX #N (e.g., UAX #15)
Update cadence	Every Unicode version release
Most implemented	UAX #9 (Bidi), UAX #15 (Normalization), UAX #29 (Segmentation)
Identifier rules	UAX #31 (used by Python, Java, Rust, etc.)
UCD documentation	UAX #44
Publication URL	unicode.org/reports/

Unicode 標準のその他の用語

CJK（漢字・かな・ハングル）

中国語・日本語・韓国語 — Unicodeにおける統合漢字ブロックと関連スクリプトをまとめた総称。CJK統合漢字は20,992文字以上を含みます。

Han Unification

The process of mapping Chinese, Japanese, and Korean ideographs that share a …

Hangul Jamo

The individual consonant and vowel components (jamo) of the Korean Hangul writing …

ISO 10646 / 万国文字集合

Unicodeと同期している国際標準（ISO/IEC 10646）で、同じ文字目録とコードポイントを定義しますが、Unicodeの追加アルゴリズムやプロパティは含みません。

Unicode

あらゆる文字システムのすべての文字に固有の番号（コードポイント）を割り当てる普遍的文字エンコーディング規格。バージョン16.0には154,998個の割り当て済み文字が含まれます。

Unicode Technical Report (UTR)

Informational documents published by the Unicode Consortium covering specific topics like security …

Unicode コンソーシアム

Unicode標準を開発・維持する非営利団体。Apple・Google・Microsoft・Metaなど多くの企業が会員です。

Unicode スカラー値

サロゲートコードポイント（U+D800〜U+DFFF）を除くすべてのコードポイント。実際の文字を表すことができる有効な値の集合で、合計1,112,064個です。

Unicode バージョン

新しい文字・文字体系・機能を追加するUnicode標準の主要リリース。現在のバージョンはUnicode 16.0（2025年9月）です。

Unicode 安定性ポリシー

一度割り当てられた文字のコードポイントと名前は絶対に変更されないことを保証するポリシー。プロパティは改訂される場合がありますが、割り当ては永続的です。

← 用語集へ