Unicode Character Database (UCD)
Kumpulan file data yang dapat dibaca mesin yang mendefinisikan semua properti karakter Unicode, termasuk UnicodeData.txt, Blocks.txt, Scripts.txt, dan banyak lagi.
What is the Unicode Character Database?
The Unicode Character Database (UCD) is the authoritative, machine-readable repository of all
properties for every Unicode code point. Where the Unicode Standard describes characters in
prose and tables, the UCD provides that same information in structured data files that software
libraries can parse and implement automatically. Every Unicode library — ICU, Python's
unicodedata, Java's Character class, .NET's CharUnicodeInfo — is built from the UCD.
The UCD is distributed as a collection of plain-text files published on unicode.org for each
Unicode version. The files follow documented formats (simple tables, two-column mappings, or the
comprehensive UnicodeData.txt) and are freely available for any use.
Core UCD Files
| File | Description |
|---|---|
UnicodeData.txt |
One line per assigned code point: name, category, combining class, bidi class, decomposition, numeric values, case mappings |
PropList.txt |
Boolean properties like White_Space, Dash, Diacritic, Extender |
DerivedCoreProperties.txt |
Derived properties like Alphabetic, Math, ID_Start, ID_Continue |
Blocks.txt |
Block name → code point range mapping |
Scripts.txt |
Script assignment for each code point (Latin, Arabic, Han, etc.) |
EmojiData.txt |
Emoji-specific properties: Emoji, Emoji_Presentation, Emoji_Modifier |
NameAliases.txt |
Formal aliases (corrections, abbreviations, alternate names) |
CaseFolding.txt |
Case-insensitive comparison mappings |
NormalizationTest.txt |
Test vectors for NFC/NFD/NFKC/NFKD implementations |
CompositionExclusions.txt |
Code points excluded from canonical composition |
UnicodeData.txt Format
The most important UCD file is UnicodeData.txt. Each line has 15 semicolon-delimited fields:
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
│ │ │ │ │
│ │ │ │ └── Bidi class (L = Left-to-right)
│ │ │ └──── Canonical combining class (0 = not combining)
│ │ └─────── General category (Lu = Uppercase Letter)
│ └─────────────────────────────── Character name
└──────────────────────────────────── Code point (hex)
The 15 fields in order: 1. Code point (hex) 2. Character name 3. General category 4. Canonical combining class 5. Bidi class 6. Character decomposition 7–9. Numeric values (decimal, digit, numeric) 10. Mirror flag 11. Unicode 1.0 name (legacy) 12. ISO comment (deprecated) 13–15. Uppercase, lowercase, titlecase mappings
Using the UCD in Practice
import unicodedata
# Python's unicodedata module is a UCD interface
char = "A"
print(unicodedata.name(char)) # LATIN CAPITAL LETTER A
print(unicodedata.category(char)) # Lu (Uppercase Letter)
print(unicodedata.bidirectional(char)) # L
print(unicodedata.combining(char)) # 0 (non-combining)
print(unicodedata.normalize("NFD", "é")) # e + combining accent
# Reading UnicodeData.txt directly
import urllib.request
url = "https://unicode.org/Public/UCD/latest/ucd/UnicodeData.txt"
with urllib.request.urlopen(url) as f:
for line in f:
fields = line.decode().strip().split(";")
cp, name, category = fields[0], fields[1], fields[2]
if category == "So": # Other Symbol
print(f"U+{cp}: {name}")
Common Pitfalls
Name vs Alias: UnicodeData.txt lists the normative name, but some characters have aliases
(corrections to historical naming errors). For example, U+FE18 is officially named PRESENTATION
FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET (note the typo), but the alias file provides
the corrected name.
Ranges in UnicodeData.txt: CJK Unified Ideographs are not listed individually. Instead, two lines mark the start and end of the range:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FFF;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
Quick Facts
| Property | Value |
|---|---|
| Primary file | UnicodeData.txt |
| Total files in UCD | ~60 files |
| URL | unicode.org/Public/UCD/latest/ucd/ |
| License | Unicode License (free, attribution required) |
| Update frequency | Each Unicode version |
| Key consumer | ICU, Python unicodedata, Java Character, .NET |
| Fields in UnicodeData.txt | 15 per line |
Istilah Terkait
Lainnya di Standar Unicode
Rentang yang dicadangkan di mana organisasi dapat menetapkan karakter mereka sendiri: BMP …
Bidang 0 (U+0000–U+FFFF), berisi karakter yang paling umum digunakan termasuk Latin, Yunani, …
Blok berurutan yang terdiri dari 65.536 titik kode. Unicode memiliki 17 bidang …
Bidang 1–16 (U+10000–U+10FFFF), berisi emoji, skrip historis, ekstensi CJK, dan notasi musik. …
Titik kode yang dicadangkan secara permanen untuk penggunaan internal (66 total): U+FDD0–U+FDEF …
Cina, Jepang, dan Korea — istilah kolektif untuk blok ideograf Han yang …
The process of mapping Chinese, Japanese, and Korean ideographs that share a …
The individual consonant and vowel components (jamo) of the Korean Hangul writing …
Standar internasional (ISO/IEC 10646) yang disinkronkan dengan Unicode, mendefinisikan repertoar karakter dan …
Unit informasi yang digunakan untuk mengorganisasi, mengontrol, atau merepresentasikan data tekstual — …