Unicode Character Database (UCD)
Maschinenlesbare Sammlung von Datendateien, die alle Unicode-Zeicheneigenschaften definiert, darunter UnicodeData.txt, Blocks.txt, Scripts.txt und viele weitere.
What is the Unicode Character Database?
The Unicode Character Database (UCD) is the authoritative, machine-readable repository of all
properties for every Unicode code point. Where the Unicode Standard describes characters in
prose and tables, the UCD provides that same information in structured data files that software
libraries can parse and implement automatically. Every Unicode library — ICU, Python's
unicodedata, Java's Character class, .NET's CharUnicodeInfo — is built from the UCD.
The UCD is distributed as a collection of plain-text files published on unicode.org for each
Unicode version. The files follow documented formats (simple tables, two-column mappings, or the
comprehensive UnicodeData.txt) and are freely available for any use.
Core UCD Files
| File | Description |
|---|---|
UnicodeData.txt |
One line per assigned code point: name, category, combining class, bidi class, decomposition, numeric values, case mappings |
PropList.txt |
Boolean properties like White_Space, Dash, Diacritic, Extender |
DerivedCoreProperties.txt |
Derived properties like Alphabetic, Math, ID_Start, ID_Continue |
Blocks.txt |
Block name → code point range mapping |
Scripts.txt |
Script assignment for each code point (Latin, Arabic, Han, etc.) |
EmojiData.txt |
Emoji-specific properties: Emoji, Emoji_Presentation, Emoji_Modifier |
NameAliases.txt |
Formal aliases (corrections, abbreviations, alternate names) |
CaseFolding.txt |
Case-insensitive comparison mappings |
NormalizationTest.txt |
Test vectors for NFC/NFD/NFKC/NFKD implementations |
CompositionExclusions.txt |
Code points excluded from canonical composition |
UnicodeData.txt Format
The most important UCD file is UnicodeData.txt. Each line has 15 semicolon-delimited fields:
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
│ │ │ │ │
│ │ │ │ └── Bidi class (L = Left-to-right)
│ │ │ └──── Canonical combining class (0 = not combining)
│ │ └─────── General category (Lu = Uppercase Letter)
│ └─────────────────────────────── Character name
└──────────────────────────────────── Code point (hex)
The 15 fields in order: 1. Code point (hex) 2. Character name 3. General category 4. Canonical combining class 5. Bidi class 6. Character decomposition 7–9. Numeric values (decimal, digit, numeric) 10. Mirror flag 11. Unicode 1.0 name (legacy) 12. ISO comment (deprecated) 13–15. Uppercase, lowercase, titlecase mappings
Using the UCD in Practice
import unicodedata
# Python's unicodedata module is a UCD interface
char = "A"
print(unicodedata.name(char)) # LATIN CAPITAL LETTER A
print(unicodedata.category(char)) # Lu (Uppercase Letter)
print(unicodedata.bidirectional(char)) # L
print(unicodedata.combining(char)) # 0 (non-combining)
print(unicodedata.normalize("NFD", "é")) # e + combining accent
# Reading UnicodeData.txt directly
import urllib.request
url = "https://unicode.org/Public/UCD/latest/ucd/UnicodeData.txt"
with urllib.request.urlopen(url) as f:
for line in f:
fields = line.decode().strip().split(";")
cp, name, category = fields[0], fields[1], fields[2]
if category == "So": # Other Symbol
print(f"U+{cp}: {name}")
Common Pitfalls
Name vs Alias: UnicodeData.txt lists the normative name, but some characters have aliases
(corrections to historical naming errors). For example, U+FE18 is officially named PRESENTATION
FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET (note the typo), but the alias file provides
the corrected name.
Ranges in UnicodeData.txt: CJK Unified Ideographs are not listed individually. Instead, two lines mark the start and end of the range:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FFF;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
Quick Facts
| Property | Value |
|---|---|
| Primary file | UnicodeData.txt |
| Total files in UCD | ~60 files |
| URL | unicode.org/Public/UCD/latest/ucd/ |
| License | Unicode License (free, attribution required) |
| Update frequency | Each Unicode version |
| Key consumer | ICU, Python unicodedata, Java Character, .NET |
| Fields in UnicodeData.txt | 15 per line |
Verwandte Begriffe
Mehr in Unicode-Standard
Eine Informationseinheit zur Organisation, Steuerung oder Darstellung von Textdaten — die konzeptionelle …
Ebene 0 (U+0000–U+FFFF) mit den am häufigsten verwendeten Zeichen, darunter Lateinisch, Griechisch, …
Chinesisch, Japanisch und Koreanisch — der Sammelbegriff für den vereinheitlichten Han-Ideogramm-Block und …
Die kleinste Kodierungseinheit: ein 8-Bit-Byte in UTF-8, ein 16-Bit-Wort in UTF-16, ein …
Ein numerischer Wert im Unicode-Coderaum (U+0000 bis U+10FFFF), geschrieben als U+XXXX. Nicht …
Der vollständige Bereich möglicher Unicode-Codepunkte: U+0000 bis U+10FFFF (insgesamt 1.114.112), aufgeteilt in …
Ein zusammenhängender Block von 65.536 Codepunkten. Unicode hat 17 Ebenen (0–16): Ebene …
Ebenen 1–16 (U+10000–U+10FFFF) mit Emoji, historischen Schriften, CJK-Erweiterungen und Musiknotation. Erfordert Ersatzzeichenpaare …
Codepunkte U+D800–U+DFFF, ausschließlich für UTF-16-Ersatzzeichenpaare reserviert. Keine gültigen Unicode-Skalarwerte und dürfen nie …
The process of mapping Chinese, Japanese, and Korean ideographs that share a …