Unicode Character Database (UCD)
Collection de fichiers de données lisibles par machine définissant toutes les propriétés des caractères Unicode, notamment UnicodeData.txt, Blocks.txt, Scripts.txt et bien d'autres.
What is the Unicode Character Database?
The Unicode Character Database (UCD) is the authoritative, machine-readable repository of all
properties for every Unicode code point. Where the Unicode Standard describes characters in
prose and tables, the UCD provides that same information in structured data files that software
libraries can parse and implement automatically. Every Unicode library — ICU, Python's
unicodedata, Java's Character class, .NET's CharUnicodeInfo — is built from the UCD.
The UCD is distributed as a collection of plain-text files published on unicode.org for each
Unicode version. The files follow documented formats (simple tables, two-column mappings, or the
comprehensive UnicodeData.txt) and are freely available for any use.
Core UCD Files
| File | Description |
|---|---|
UnicodeData.txt |
One line per assigned code point: name, category, combining class, bidi class, decomposition, numeric values, case mappings |
PropList.txt |
Boolean properties like White_Space, Dash, Diacritic, Extender |
DerivedCoreProperties.txt |
Derived properties like Alphabetic, Math, ID_Start, ID_Continue |
Blocks.txt |
Block name → code point range mapping |
Scripts.txt |
Script assignment for each code point (Latin, Arabic, Han, etc.) |
EmojiData.txt |
Emoji-specific properties: Emoji, Emoji_Presentation, Emoji_Modifier |
NameAliases.txt |
Formal aliases (corrections, abbreviations, alternate names) |
CaseFolding.txt |
Case-insensitive comparison mappings |
NormalizationTest.txt |
Test vectors for NFC/NFD/NFKC/NFKD implementations |
CompositionExclusions.txt |
Code points excluded from canonical composition |
UnicodeData.txt Format
The most important UCD file is UnicodeData.txt. Each line has 15 semicolon-delimited fields:
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
│ │ │ │ │
│ │ │ │ └── Bidi class (L = Left-to-right)
│ │ │ └──── Canonical combining class (0 = not combining)
│ │ └─────── General category (Lu = Uppercase Letter)
│ └─────────────────────────────── Character name
└──────────────────────────────────── Code point (hex)
The 15 fields in order: 1. Code point (hex) 2. Character name 3. General category 4. Canonical combining class 5. Bidi class 6. Character decomposition 7–9. Numeric values (decimal, digit, numeric) 10. Mirror flag 11. Unicode 1.0 name (legacy) 12. ISO comment (deprecated) 13–15. Uppercase, lowercase, titlecase mappings
Using the UCD in Practice
import unicodedata
# Python's unicodedata module is a UCD interface
char = "A"
print(unicodedata.name(char)) # LATIN CAPITAL LETTER A
print(unicodedata.category(char)) # Lu (Uppercase Letter)
print(unicodedata.bidirectional(char)) # L
print(unicodedata.combining(char)) # 0 (non-combining)
print(unicodedata.normalize("NFD", "é")) # e + combining accent
# Reading UnicodeData.txt directly
import urllib.request
url = "https://unicode.org/Public/UCD/latest/ucd/UnicodeData.txt"
with urllib.request.urlopen(url) as f:
for line in f:
fields = line.decode().strip().split(";")
cp, name, category = fields[0], fields[1], fields[2]
if category == "So": # Other Symbol
print(f"U+{cp}: {name}")
Common Pitfalls
Name vs Alias: UnicodeData.txt lists the normative name, but some characters have aliases
(corrections to historical naming errors). For example, U+FE18 is officially named PRESENTATION
FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET (note the typo), but the alias file provides
the corrected name.
Ranges in UnicodeData.txt: CJK Unified Ideographs are not listed individually. Instead, two lines mark the start and end of the range:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FFF;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
Quick Facts
| Property | Value |
|---|---|
| Primary file | UnicodeData.txt |
| Total files in UCD | ~60 files |
| URL | unicode.org/Public/UCD/latest/ucd/ |
| License | Unicode License (free, attribution required) |
| Update frequency | Each Unicode version |
| Key consumer | ICU, Python unicodedata, Java Character, .NET |
| Fields in UnicodeData.txt | 15 per line |
Termes associés
Plus dans Norme Unicode
Plan 0 (U+0000–U+FFFF), contenant les caractères les plus courants : latin, grec, …
Unité d'information utilisée pour organiser, contrôler ou représenter des données textuelles — …
Point de code auquel un caractère a été attribué dans une version …
Chinois, Japonais et Coréen — le terme collectif pour le bloc des …
Organisation à but non lucratif qui développe et maintient le standard Unicode. …
La plage complète des points de code Unicode possibles : U+0000 à …
The process of mapping Chinese, Japanese, and Korean ideographs that share a …
The individual consonant and vowel components (jamo) of the Korean Hangul writing …
Norme internationale (ISO/IEC 10646) synchronisée avec Unicode, définissant le même répertoire de …
Points de code définitivement réservés à un usage interne (66 au total) …