Unicode 字符数据库 (UCD)
定义所有Unicode字符属性的机器可读数据文件集合,包括UnicodeData.txt、Blocks.txt、Scripts.txt等。
What is the Unicode Character Database?
The Unicode Character Database (UCD) is the authoritative, machine-readable repository of all
properties for every Unicode code point. Where the Unicode Standard describes characters in
prose and tables, the UCD provides that same information in structured data files that software
libraries can parse and implement automatically. Every Unicode library — ICU, Python's
unicodedata, Java's Character class, .NET's CharUnicodeInfo — is built from the UCD.
The UCD is distributed as a collection of plain-text files published on unicode.org for each
Unicode version. The files follow documented formats (simple tables, two-column mappings, or the
comprehensive UnicodeData.txt) and are freely available for any use.
Core UCD Files
| File | Description |
|---|---|
UnicodeData.txt |
One line per assigned code point: name, category, combining class, bidi class, decomposition, numeric values, case mappings |
PropList.txt |
Boolean properties like White_Space, Dash, Diacritic, Extender |
DerivedCoreProperties.txt |
Derived properties like Alphabetic, Math, ID_Start, ID_Continue |
Blocks.txt |
Block name → code point range mapping |
Scripts.txt |
Script assignment for each code point (Latin, Arabic, Han, etc.) |
EmojiData.txt |
Emoji-specific properties: Emoji, Emoji_Presentation, Emoji_Modifier |
NameAliases.txt |
Formal aliases (corrections, abbreviations, alternate names) |
CaseFolding.txt |
Case-insensitive comparison mappings |
NormalizationTest.txt |
Test vectors for NFC/NFD/NFKC/NFKD implementations |
CompositionExclusions.txt |
Code points excluded from canonical composition |
UnicodeData.txt Format
The most important UCD file is UnicodeData.txt. Each line has 15 semicolon-delimited fields:
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
│ │ │ │ │
│ │ │ │ └── Bidi class (L = Left-to-right)
│ │ │ └──── Canonical combining class (0 = not combining)
│ │ └─────── General category (Lu = Uppercase Letter)
│ └─────────────────────────────── Character name
└──────────────────────────────────── Code point (hex)
The 15 fields in order: 1. Code point (hex) 2. Character name 3. General category 4. Canonical combining class 5. Bidi class 6. Character decomposition 7–9. Numeric values (decimal, digit, numeric) 10. Mirror flag 11. Unicode 1.0 name (legacy) 12. ISO comment (deprecated) 13–15. Uppercase, lowercase, titlecase mappings
Using the UCD in Practice
import unicodedata
# Python's unicodedata module is a UCD interface
char = "A"
print(unicodedata.name(char)) # LATIN CAPITAL LETTER A
print(unicodedata.category(char)) # Lu (Uppercase Letter)
print(unicodedata.bidirectional(char)) # L
print(unicodedata.combining(char)) # 0 (non-combining)
print(unicodedata.normalize("NFD", "é")) # e + combining accent
# Reading UnicodeData.txt directly
import urllib.request
url = "https://unicode.org/Public/UCD/latest/ucd/UnicodeData.txt"
with urllib.request.urlopen(url) as f:
for line in f:
fields = line.decode().strip().split(";")
cp, name, category = fields[0], fields[1], fields[2]
if category == "So": # Other Symbol
print(f"U+{cp}: {name}")
Common Pitfalls
Name vs Alias: UnicodeData.txt lists the normative name, but some characters have aliases
(corrections to historical naming errors). For example, U+FE18 is officially named PRESENTATION
FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET (note the typo), but the alias file provides
the corrected name.
Ranges in UnicodeData.txt: CJK Unified Ideographs are not listed individually. Instead, two lines mark the start and end of the range:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FFF;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
Quick Facts
| Property | Value |
|---|---|
| Primary file | UnicodeData.txt |
| Total files in UCD | ~60 files |
| URL | unicode.org/Public/UCD/latest/ucd/ |
| License | Unicode License (free, attribution required) |
| Update frequency | Each Unicode version |
| Key consumer | ICU, Python unicodedata, Java Character, .NET |
| Fields in UnicodeData.txt | 15 per line |
相关术语
Unicode 标准 中的更多内容
中日韩——Unicode中统一汉字区块及相关文字系统的统称,CJK统一表意文字包含20,992个以上字符。
The process of mapping Chinese, Japanese, and Korean ideographs that share a …
The individual consonant and vowel components (jamo) of the Korean Hangul writing …
与Unicode同步的国际标准(ISO/IEC 10646),定义相同的字符集和码位,但不包含Unicode额外的算法和属性。
为每种书写系统中的每个字符分配唯一编号(码位)的通用字符编码标准,16.0版本包含154,998个已分配字符。
Normative or informative documents that are integral parts of the Unicode Standard. …
Informational documents published by the Unicode Consortium covering specific topics like security …
除代理码位(U+D800–U+DFFF)之外的所有码位,是可表示实际字符的有效值集合,共1,112,064个。
Unicode标准的主要版本,每次发布均新增字符、文字系统和功能,当前版本为Unicode 16.0(2025年9月)。
保证字符一旦分配,其码位和名称永不更改的策略。属性可以精化,但分配是永久性的。