What is Normalisation?

Processus de conversion du texte Unicode en une forme canonique standard. Quatre formes : NFC (composée), NFD (décomposée), NFKC (compatibilité composée), NFKD (compatibilité décomposée).

What is NFC (Canonical Composition)?

Forme de normalisation C : décomposer puis recomposer canoniquement, produisant la forme la plus courte. Recommandée pour le stockage et l'échange de données ; la forme standard du web.

What is Équivalence canonique?

Deux séquences de caractères sémantiquement identiques qui doivent être traitées comme égales. Exemple : é (U+00E9) ≡ e + ◌́ (U+0065 + U+0301).

Algorithmes

NFD (Canonical Decomposition)

Forme de normalisation D : décomposition complète sans recomposition. Utilisée par le système de fichiers HFS+ de macOS. é (U+00E9) → e + ◌́ (U+0065 + U+0301).

2022-06-27 · Updated 2024-05-17

NFD: Decomposing Characters to Their Components

NFD (Normalization Form D — Canonical Decomposition) is the form where every composed character is broken down into its constituent parts: a base character followed by one or more combining marks in canonical order. Unlike NFC which recomposes after decomposing, NFD leaves everything decomposed.

For é (U+00E9, LATIN SMALL LETTER E WITH ACUTE), NFD yields the two-code-point sequence e (U+0065) + ́ (U+0301, COMBINING ACUTE ACCENT). For a string like "über", NFD yields u + combining diaeresis + b + e + r — five code points for a four-character word.

When NFD is Used

NFD is the internal form used by Apple's HFS+ and APFS file systems. When macOS writes a filename to disk, it stores it in NFD. This is why filenames created on a Mac can cause issues when transferred to Linux systems: the file café is stored as 5 code points (NFD) on HFS+ but most Linux applications expect 4 code points (NFC). Python's os module will give you the NFD filename on macOS unless you normalize it.

NFD is also useful when you need to strip diacritics:

import unicodedata

def strip_accents(text: str) -> str:
    # Decompose to base + combining marks, then remove all combining marks
    nfd = unicodedata.normalize("NFD", text)
    return "".join(
        c for c in nfd
        if unicodedata.category(c) != "Mn"  # Mn = Mark, Nonspacing
    )

print(strip_accents("naïve"))   # naive
print(strip_accents("résumé"))  # resume
print(strip_accents("über"))    # uber

Combining Mark Order

A subtle but important aspect of NFD: when multiple combining marks follow a base character, Unicode specifies their Canonical Combining Class (CCC) order. Marks with lower CCC values come first (CCC=0 is the base character). The acute accent has CCC=230, the cedilla has CCC=202. NFD ensures combining marks are always in this canonical order, which is necessary for correct canonical equivalence testing.

import unicodedata

# Check combining class of a character
print(unicodedata.combining("\u0301"))  # 230 (COMBINING ACUTE ACCENT)
print(unicodedata.combining("\u0327"))  # 202 (COMBINING CEDILLA)
print(unicodedata.combining("a"))       # 0   (not a combining mark)

Quick Facts

Property	Value
Full name	Normalization Form Canonical Decomposition
Algorithm	Recursive canonical decomposition, then CCC sorting
macOS HFS+	Filenames stored in NFD
Typical use	Diacritic stripping, internal processing, canonical comparison
Python	`unicodedata.normalize("NFD", s)`
Handles compatibility chars?	No
Relation to NFC	NFD then compose = NFC
String length	Equal or longer than NFC (decomposed forms use more code points)

Termes associés

Normalisation NFC (Canonical Composition) Équivalence canonique

Plus dans Algorithmes

Algorithme bidirectionnel

Algorithme déterminant l'ordre d'affichage des caractères dans un texte à direction mixte …

Algorithme de classement

Algorithme standard de comparaison et de tri de chaînes Unicode via une …

Algorithme de coupure de ligne

Règles déterminant où le texte peut passer à la ligne suivante, en …

Case Folding

Mapping characters to a common case form for case-insensitive comparison. More comprehensive …

Exclusion de composition

Caractères exclus de la composition canonique (NFC) pour éviter la décomposition des …

Frontière de mot

La position entre les mots selon les règles de coupure de mots …

Frontière de phrase

La position entre les phrases selon les règles Unicode. Plus complexe qu'un …

Grapheme Cluster Boundary

Rules (UAX#29) for determining where one user-perceived character ends and another begins. …

NFC (Canonical Composition)

Forme de normalisation C : décomposer puis recomposer canoniquement, produisant la forme …

NFKC (Compatibility Composition)

Forme de normalisation KC : décomposition de compatibilité puis composition canonique. Fusionne …

← Retour au glossaire