유니코드 문자 데이터베이스 (UCD)
모든 유니코드 문자 속성을 정의하는 기계 판독 가능한 데이터 파일 모음으로, UnicodeData.txt, Blocks.txt, Scripts.txt 등이 포함됩니다.
What is the Unicode Character Database?
The Unicode Character Database (UCD) is the authoritative, machine-readable repository of all
properties for every Unicode code point. Where the Unicode Standard describes characters in
prose and tables, the UCD provides that same information in structured data files that software
libraries can parse and implement automatically. Every Unicode library — ICU, Python's
unicodedata, Java's Character class, .NET's CharUnicodeInfo — is built from the UCD.
The UCD is distributed as a collection of plain-text files published on unicode.org for each
Unicode version. The files follow documented formats (simple tables, two-column mappings, or the
comprehensive UnicodeData.txt) and are freely available for any use.
Core UCD Files
| File | Description |
|---|---|
UnicodeData.txt |
One line per assigned code point: name, category, combining class, bidi class, decomposition, numeric values, case mappings |
PropList.txt |
Boolean properties like White_Space, Dash, Diacritic, Extender |
DerivedCoreProperties.txt |
Derived properties like Alphabetic, Math, ID_Start, ID_Continue |
Blocks.txt |
Block name → code point range mapping |
Scripts.txt |
Script assignment for each code point (Latin, Arabic, Han, etc.) |
EmojiData.txt |
Emoji-specific properties: Emoji, Emoji_Presentation, Emoji_Modifier |
NameAliases.txt |
Formal aliases (corrections, abbreviations, alternate names) |
CaseFolding.txt |
Case-insensitive comparison mappings |
NormalizationTest.txt |
Test vectors for NFC/NFD/NFKC/NFKD implementations |
CompositionExclusions.txt |
Code points excluded from canonical composition |
UnicodeData.txt Format
The most important UCD file is UnicodeData.txt. Each line has 15 semicolon-delimited fields:
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
│ │ │ │ │
│ │ │ │ └── Bidi class (L = Left-to-right)
│ │ │ └──── Canonical combining class (0 = not combining)
│ │ └─────── General category (Lu = Uppercase Letter)
│ └─────────────────────────────── Character name
└──────────────────────────────────── Code point (hex)
The 15 fields in order: 1. Code point (hex) 2. Character name 3. General category 4. Canonical combining class 5. Bidi class 6. Character decomposition 7–9. Numeric values (decimal, digit, numeric) 10. Mirror flag 11. Unicode 1.0 name (legacy) 12. ISO comment (deprecated) 13–15. Uppercase, lowercase, titlecase mappings
Using the UCD in Practice
import unicodedata
# Python's unicodedata module is a UCD interface
char = "A"
print(unicodedata.name(char)) # LATIN CAPITAL LETTER A
print(unicodedata.category(char)) # Lu (Uppercase Letter)
print(unicodedata.bidirectional(char)) # L
print(unicodedata.combining(char)) # 0 (non-combining)
print(unicodedata.normalize("NFD", "é")) # e + combining accent
# Reading UnicodeData.txt directly
import urllib.request
url = "https://unicode.org/Public/UCD/latest/ucd/UnicodeData.txt"
with urllib.request.urlopen(url) as f:
for line in f:
fields = line.decode().strip().split(";")
cp, name, category = fields[0], fields[1], fields[2]
if category == "So": # Other Symbol
print(f"U+{cp}: {name}")
Common Pitfalls
Name vs Alias: UnicodeData.txt lists the normative name, but some characters have aliases
(corrections to historical naming errors). For example, U+FE18 is officially named PRESENTATION
FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET (note the typo), but the alias file provides
the corrected name.
Ranges in UnicodeData.txt: CJK Unified Ideographs are not listed individually. Instead, two lines mark the start and end of the range:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FFF;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
Quick Facts
| Property | Value |
|---|---|
| Primary file | UnicodeData.txt |
| Total files in UCD | ~60 files |
| URL | unicode.org/Public/UCD/latest/ucd/ |
| License | Unicode License (free, attribution required) |
| Update frequency | Each Unicode version |
| Key consumer | ICU, Python unicodedata, Java Character, .NET |
| Fields in UnicodeData.txt | 15 per line |
관련 용어
유니코드 표준의 더 많은 용어
한중일 — 유니코드에서 통합 한자 블록 및 관련 문자 체계를 아우르는 집합적 …
The process of mapping Chinese, Japanese, and Korean ideographs that share a …
The individual consonant and vowel components (jamo) of the Korean Hangul writing …
유니코드와 동기화된 국제 표준(ISO/IEC 10646)으로, 동일한 문자 목록과 코드 포인트를 정의하지만 유니코드의 …
모든 문자 체계의 모든 문자에 고유 번호(코드 포인트)를 부여하는 범용 문자 인코딩 …
Normative or informative documents that are integral parts of the Unicode Standard. …
Informational documents published by the Unicode Consortium covering specific topics like security …
평면 0(U+0000~U+FFFF)으로, 라틴, 그리스, 키릴, CJK, 아랍 문자 및 대부분의 기호 등 …
어느 유니코드 버전에서도 문자가 할당되지 않은 코드 포인트로, Cn(미할당)으로 분류됩니다. 향후 버전에서 …
평면 1~16(U+10000~U+10FFFF)으로, 이모지, 고대 문자, CJK 확장, 악보 등을 포함합니다. UTF-16에서는 서로게이트 …