What is Unicode Normalization?

Process of converting Unicode text to a standard canonical form. Four forms: NFC (composed), NFD (decomposed), NFKC (compatibility composed), NFKD (compatibility decomposed).

What is Canonical Equivalence?

Two character sequences that are semantically identical and should be treated as equal. Example: é (U+00E9) ≡ e + ◌́ (U+0065 + U+0301).

Algorithms

NFD (Canonical Decomposition)

Normalization Form D: fully decompose without recomposing. Used by the macOS HFS+ filesystem. é (U+00E9) → e + ◌́ (U+0065 + U+0301).

2022-06-27 · Updated 2024-05-17

NFD: Decomposing Characters to Their Components

NFD (Normalization Form D — Canonical Decomposition) is the form where every composed character is broken down into its constituent parts: a base character followed by one or more combining marks in canonical order. Unlike NFC which recomposes after decomposing, NFD leaves everything decomposed.

For é (U+00E9, LATIN SMALL LETTER E WITH ACUTE), NFD yields the two-code-point sequence e (U+0065) + ́ (U+0301, COMBINING ACUTE ACCENT). For a string like "über", NFD yields u + combining diaeresis + b + e + r — five code points for a four-character word.

When NFD is Used

NFD is the internal form used by Apple's HFS+ and APFS file systems. When macOS writes a filename to disk, it stores it in NFD. This is why filenames created on a Mac can cause issues when transferred to Linux systems: the file café is stored as 5 code points (NFD) on HFS+ but most Linux applications expect 4 code points (NFC). Python's os module will give you the NFD filename on macOS unless you normalize it.

NFD is also useful when you need to strip diacritics:

import unicodedata

def strip_accents(text: str) -> str:
    # Decompose to base + combining marks, then remove all combining marks
    nfd = unicodedata.normalize("NFD", text)
    return "".join(
        c for c in nfd
        if unicodedata.category(c) != "Mn"  # Mn = Mark, Nonspacing
    )

print(strip_accents("naïve"))   # naive
print(strip_accents("résumé"))  # resume
print(strip_accents("über"))    # uber

Combining Mark Order

A subtle but important aspect of NFD: when multiple combining marks follow a base character, Unicode specifies their Canonical Combining Class (CCC) order. Marks with lower CCC values come first (CCC=0 is the base character). The acute accent has CCC=230, the cedilla has CCC=202. NFD ensures combining marks are always in this canonical order, which is necessary for correct canonical equivalence testing.

import unicodedata

# Check combining class of a character
print(unicodedata.combining("\u0301"))  # 230 (COMBINING ACUTE ACCENT)
print(unicodedata.combining("\u0327"))  # 202 (COMBINING CEDILLA)
print(unicodedata.combining("a"))       # 0   (not a combining mark)

Quick Facts

Property	Value
Full name	Normalization Form Canonical Decomposition
Algorithm	Recursive canonical decomposition, then CCC sorting
macOS HFS+	Filenames stored in NFD
Typical use	Diacritic stripping, internal processing, canonical comparison
Python	`unicodedata.normalize("NFD", s)`
Handles compatibility chars?	No
Relation to NFC	NFD then compose = NFC
String length	Equal or longer than NFC (decomposed forms use more code points)

Related Terms

Unicode Normalization NFC (Canonical Composition) Canonical Equivalence

More in Algorithms

Case Folding

Mapping characters to a common case form for case-insensitive comparison. More comprehensive …

Composition Exclusion

Characters excluded from canonical composition (NFC) to prevent non-starter decomposition and ensure …

Grapheme Cluster Boundary

Rules (UAX#29) for determining where one user-perceived character ends and another begins. …

NFC (Canonical Composition)

Normalization Form C: decompose then recompose canonically, producing the shortest form. Recommended …

NFKC (Compatibility Composition)

Normalization Form KC: compatibility decomposition then canonical composition. Merges visually similar characters …

NFKD (Compatibility Decomposition)

Normalization Form KD: compatibility decomposition without recomposing. The most aggressive normalization, losing …

Sentence Boundary

The position between sentences per Unicode rules. More complex than splitting on …

String Comparison

Comparing Unicode strings requires normalization (NFC/NFD) and optionally collation (locale-aware sorting). Binary …

Unicode Bidirectional Algorithm (UBA)

Algorithm determining display order of characters in mixed-direction text (e.g., English + …

Unicode Collation Algorithm (UCA)

Standard algorithm for comparing and sorting Unicode strings using multi-level comparison: base …

← Back to Glossary