What is Unicode Character Database (UCD)?

Machine-readable collection of data files defining all Unicode character properties, including UnicodeData.txt, Blocks.txt, Scripts.txt, and many more.

Universal character encoding standard assigning a unique number (code point) to every character in every writing system. Version 16.0 contains 154,998 assigned characters.

A named contiguous range of code points (e.g., Basic Latin = U+0000–U+007F). Unicode 16.0 defines 336 blocks; every code point belongs to exactly one block.

The writing system a character belongs to (e.g., Latin, Cyrillic, Han). Unicode 16.0 defines 168 scripts; the Script property is key for security and mixed-script detection.

What is General Category?

Classification of every code point into one of 30 categories (Lu, Ll, Nd, So, etc.) grouped into 7 major classes: Letter, Mark, Number, Punctuation, Symbol, Separator, Other.

Unicode Character Database (UCD) — Unicode Glossary

What is the Unicode Character Database?

The Unicode Character Database (UCD) is the authoritative, machine-readable repository of all properties for every Unicode code point. Where the Unicode Standard describes characters in prose and tables, the UCD provides that same information in structured data files that software libraries can parse and implement automatically. Every Unicode library — ICU, Python's unicodedata, Java's Character class, .NET's CharUnicodeInfo — is built from the UCD.

The UCD is distributed as a collection of plain-text files published on unicode.org for each Unicode version. The files follow documented formats (simple tables, two-column mappings, or the comprehensive UnicodeData.txt) and are freely available for any use.

Core UCD Files

File	Description
`UnicodeData.txt`	One line per assigned code point: name, category, combining class, bidi class, decomposition, numeric values, case mappings
`PropList.txt`	Boolean properties like `White_Space`, `Dash`, `Diacritic`, `Extender`
`DerivedCoreProperties.txt`	Derived properties like `Alphabetic`, `Math`, `ID_Start`, `ID_Continue`
`Blocks.txt`	Block name → code point range mapping
`Scripts.txt`	Script assignment for each code point (Latin, Arabic, Han, etc.)
`EmojiData.txt`	Emoji-specific properties: `Emoji`, `Emoji_Presentation`, `Emoji_Modifier`
`NameAliases.txt`	Formal aliases (corrections, abbreviations, alternate names)
`CaseFolding.txt`	Case-insensitive comparison mappings
`NormalizationTest.txt`	Test vectors for NFC/NFD/NFKC/NFKD implementations
`CompositionExclusions.txt`	Code points excluded from canonical composition

UnicodeData.txt Format

The most important UCD file is UnicodeData.txt. Each line has 15 semicolon-delimited fields:

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
│    │                       │  │ │
│    │                       │  │ └── Bidi class (L = Left-to-right)
│    │                       │  └──── Canonical combining class (0 = not combining)
│    │                       └─────── General category (Lu = Uppercase Letter)
│    └─────────────────────────────── Character name
└──────────────────────────────────── Code point (hex)

The 15 fields in order: 1. Code point (hex) 2. Character name 3. General category 4. Canonical combining class 5. Bidi class 6. Character decomposition 7–9. Numeric values (decimal, digit, numeric) 10. Mirror flag 11. Unicode 1.0 name (legacy) 12. ISO comment (deprecated) 13–15. Uppercase, lowercase, titlecase mappings

Using the UCD in Practice

import unicodedata

# Python's unicodedata module is a UCD interface
char = "A"
print(unicodedata.name(char))           # LATIN CAPITAL LETTER A
print(unicodedata.category(char))       # Lu (Uppercase Letter)
print(unicodedata.bidirectional(char))  # L
print(unicodedata.combining(char))      # 0 (non-combining)
print(unicodedata.normalize("NFD", "é")) # e + combining accent

# Reading UnicodeData.txt directly
import urllib.request

url = "https://unicode.org/Public/UCD/latest/ucd/UnicodeData.txt"
with urllib.request.urlopen(url) as f:
    for line in f:
        fields = line.decode().strip().split(";")
        cp, name, category = fields[0], fields[1], fields[2]
        if category == "So":  # Other Symbol
            print(f"U+{cp}: {name}")

Common Pitfalls

Name vs Alias: UnicodeData.txt lists the normative name, but some characters have aliases (corrections to historical naming errors). For example, U+FE18 is officially named PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET (note the typo), but the alias file provides the corrected name.

Ranges in UnicodeData.txt: CJK Unified Ideographs are not listed individually. Instead, two lines mark the start and end of the range:

4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FFF;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;

Quick Facts

Property	Value
Primary file	`UnicodeData.txt`
Total files in UCD	~60 files
URL	unicode.org/Public/UCD/latest/ucd/
License	Unicode License (free, attribution required)
Update frequency	Each Unicode version
Key consumer	ICU, Python unicodedata, Java Character, .NET
Fields in UnicodeData.txt	15 per line

Unicode Character Database (UCD)