Norme universelle d'encodage de caractères attribuant un numéro unique (point de code) à chaque caractère de tous les systèmes d'écriture. La version 16.0 contient 154 998 caractères assignés.

Plage contiguë nommée de points de code (par ex. Basic Latin = U+0000–U+007F). Unicode 16.0 définit 336 blocs ; chaque point de code appartient à exactement un bloc.

What is Système d'écriture?

Le système d'écriture auquel appartient un caractère (par ex. latin, cyrillique, han). Unicode 16.0 définit 168 scripts ; la propriété Script est essentielle pour la sécurité et la détection d'écritures mixtes.

What is Catégorie générale?

Classification de chaque point de code dans l'une des 30 catégories (Lu, Ll, Nd, So, etc.) regroupées en 7 classes principales : Lettre, Marque, Nombre, Ponctuation, Symbole, Séparateur, Autre.

Norme Unicode

Unicode Character Database (UCD)

Collection de fichiers de données lisibles par machine définissant toutes les propriétés des caractères Unicode, notamment UnicodeData.txt, Blocks.txt, Scripts.txt et bien d'autres.

2021-07-12 · Updated 2024-03-25

What is the Unicode Character Database?

The Unicode Character Database (UCD) is the authoritative, machine-readable repository of all properties for every Unicode code point. Where the Unicode Standard describes characters in prose and tables, the UCD provides that same information in structured data files that software libraries can parse and implement automatically. Every Unicode library — ICU, Python's unicodedata, Java's Character class, .NET's CharUnicodeInfo — is built from the UCD.

The UCD is distributed as a collection of plain-text files published on unicode.org for each Unicode version. The files follow documented formats (simple tables, two-column mappings, or the comprehensive UnicodeData.txt) and are freely available for any use.

Core UCD Files

File	Description
`UnicodeData.txt`	One line per assigned code point: name, category, combining class, bidi class, decomposition, numeric values, case mappings
`PropList.txt`	Boolean properties like `White_Space`, `Dash`, `Diacritic`, `Extender`
`DerivedCoreProperties.txt`	Derived properties like `Alphabetic`, `Math`, `ID_Start`, `ID_Continue`
`Blocks.txt`	Block name → code point range mapping
`Scripts.txt`	Script assignment for each code point (Latin, Arabic, Han, etc.)
`EmojiData.txt`	Emoji-specific properties: `Emoji`, `Emoji_Presentation`, `Emoji_Modifier`
`NameAliases.txt`	Formal aliases (corrections, abbreviations, alternate names)
`CaseFolding.txt`	Case-insensitive comparison mappings
`NormalizationTest.txt`	Test vectors for NFC/NFD/NFKC/NFKD implementations
`CompositionExclusions.txt`	Code points excluded from canonical composition

UnicodeData.txt Format

The most important UCD file is UnicodeData.txt. Each line has 15 semicolon-delimited fields:

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
│    │                       │  │ │
│    │                       │  │ └── Bidi class (L = Left-to-right)
│    │                       │  └──── Canonical combining class (0 = not combining)
│    │                       └─────── General category (Lu = Uppercase Letter)
│    └─────────────────────────────── Character name
└──────────────────────────────────── Code point (hex)

The 15 fields in order: 1. Code point (hex) 2. Character name 3. General category 4. Canonical combining class 5. Bidi class 6. Character decomposition 7–9. Numeric values (decimal, digit, numeric) 10. Mirror flag 11. Unicode 1.0 name (legacy) 12. ISO comment (deprecated) 13–15. Uppercase, lowercase, titlecase mappings

Using the UCD in Practice

import unicodedata

# Python's unicodedata module is a UCD interface
char = "A"
print(unicodedata.name(char))           # LATIN CAPITAL LETTER A
print(unicodedata.category(char))       # Lu (Uppercase Letter)
print(unicodedata.bidirectional(char))  # L
print(unicodedata.combining(char))      # 0 (non-combining)
print(unicodedata.normalize("NFD", "é")) # e + combining accent

# Reading UnicodeData.txt directly
import urllib.request

url = "https://unicode.org/Public/UCD/latest/ucd/UnicodeData.txt"
with urllib.request.urlopen(url) as f:
    for line in f:
        fields = line.decode().strip().split(";")
        cp, name, category = fields[0], fields[1], fields[2]
        if category == "So":  # Other Symbol
            print(f"U+{cp}: {name}")

Common Pitfalls

Name vs Alias: UnicodeData.txt lists the normative name, but some characters have aliases (corrections to historical naming errors). For example, U+FE18 is officially named PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET (note the typo), but the alias file provides the corrected name.

Ranges in UnicodeData.txt: CJK Unified Ideographs are not listed individually. Instead, two lines mark the start and end of the range:

4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FFF;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;

Quick Facts

Property	Value
Primary file	`UnicodeData.txt`
Total files in UCD	~60 files
URL	unicode.org/Public/UCD/latest/ucd/
License	Unicode License (free, attribution required)
Update frequency	Each Unicode version
Key consumer	ICU, Python unicodedata, Java Character, .NET
Fields in UnicodeData.txt	15 per line

Termes associés

Unicode Bloc Système d'écriture Catégorie générale

Plus dans Norme Unicode

Basic Multilingual Plane (BMP)

Plan 0 (U+0000–U+FFFF), contenant les caractères les plus courants : latin, grec, …

Caractère abstrait

Unité d'information utilisée pour organiser, contrôler ou représenter des données textuelles — …

Caractère affecté

Point de code auquel un caractère a été attribué dans une version …

CJK

Chinois, Japonais et Coréen — le terme collectif pour le bloc des …

Consortium Unicode

Organisation à but non lucratif qui développe et maintient le standard Unicode. …

Espace de code

La plage complète des points de code Unicode possibles : U+0000 à …

Han Unification

The process of mapping Chinese, Japanese, and Korean ideographs that share a …

Hangul Jamo

The individual consonant and vowel components (jamo) of the Korean Hangul writing …

ISO 10646 / Universal Character Set

Norme internationale (ISO/IEC 10646) synchronisée avec Unicode, définissant le même répertoire de …

Non-caractère

Points de code définitivement réservés à un usage interne (66 au total) …

← Retour au glossaire