What is Normalisierung?

Prozess der Umwandlung von Unicode-Text in eine standardisierte kanonische Form. Vier Formen: NFC (zusammengesetzt), NFD (zerlegt), NFKC (Kompatibilität zusammengesetzt), NFKD (Kompatibilität zerlegt).

What is NFC (Canonical Composition)?

Normalisierungsform C: Zerlegen und anschließend kanonisch zusammensetzen, um die kürzeste Form zu erzeugen. Empfohlen für Datenspeicherung und -austausch; die Web-Standardform.

What is Kanonische Äquivalenz?

Zwei Zeichenfolgen, die semantisch identisch sind und als gleichwertig behandelt werden müssen. Beispiel: é (U+00E9) ≡ e + ◌́ (U+0065 + U+0301).

Algorithmen

NFD (Canonical Decomposition)

Normalisierungsform D: vollständige Zerlegung ohne Zusammensetzung. Wird vom macOS-HFS+-Dateisystem verwendet. é (U+00E9) → e + ◌́ (U+0065 + U+0301).

2022-06-27 · Updated 2024-05-17

NFD: Decomposing Characters to Their Components

NFD (Normalization Form D — Canonical Decomposition) is the form where every composed character is broken down into its constituent parts: a base character followed by one or more combining marks in canonical order. Unlike NFC which recomposes after decomposing, NFD leaves everything decomposed.

For é (U+00E9, LATIN SMALL LETTER E WITH ACUTE), NFD yields the two-code-point sequence e (U+0065) + ́ (U+0301, COMBINING ACUTE ACCENT). For a string like "über", NFD yields u + combining diaeresis + b + e + r — five code points for a four-character word.

When NFD is Used

NFD is the internal form used by Apple's HFS+ and APFS file systems. When macOS writes a filename to disk, it stores it in NFD. This is why filenames created on a Mac can cause issues when transferred to Linux systems: the file café is stored as 5 code points (NFD) on HFS+ but most Linux applications expect 4 code points (NFC). Python's os module will give you the NFD filename on macOS unless you normalize it.

NFD is also useful when you need to strip diacritics:

import unicodedata

def strip_accents(text: str) -> str:
    # Decompose to base + combining marks, then remove all combining marks
    nfd = unicodedata.normalize("NFD", text)
    return "".join(
        c for c in nfd
        if unicodedata.category(c) != "Mn"  # Mn = Mark, Nonspacing
    )

print(strip_accents("naïve"))   # naive
print(strip_accents("résumé"))  # resume
print(strip_accents("über"))    # uber

Combining Mark Order

A subtle but important aspect of NFD: when multiple combining marks follow a base character, Unicode specifies their Canonical Combining Class (CCC) order. Marks with lower CCC values come first (CCC=0 is the base character). The acute accent has CCC=230, the cedilla has CCC=202. NFD ensures combining marks are always in this canonical order, which is necessary for correct canonical equivalence testing.

import unicodedata

# Check combining class of a character
print(unicodedata.combining("\u0301"))  # 230 (COMBINING ACUTE ACCENT)
print(unicodedata.combining("\u0327"))  # 202 (COMBINING CEDILLA)
print(unicodedata.combining("a"))       # 0   (not a combining mark)

Quick Facts

Property	Value
Full name	Normalization Form Canonical Decomposition
Algorithm	Recursive canonical decomposition, then CCC sorting
macOS HFS+	Filenames stored in NFD
Typical use	Diacritic stripping, internal processing, canonical comparison
Python	`unicodedata.normalize("NFD", s)`
Handles compatibility chars?	No
Relation to NFC	NFD then compose = NFC
String length	Equal or longer than NFC (decomposed forms use more code points)

Mehr in Algorithmen

Bidirektionaler Algorithmus

Algorithmus zur Bestimmung der Anzeigereihenfolge von Zeichen in Text mit gemischter Schreibrichtung …

Case Folding

Mapping characters to a common case form for case-insensitive comparison. More comprehensive …

Grapheme Cluster Boundary

Rules (UAX#29) for determining where one user-perceived character ends and another begins. …

Kompositionsausschluss

Zeichen, die von der kanonischen Komposition (NFC) ausgeschlossen sind, um die Nicht-Starter-Zerlegung …

NFC (Canonical Composition)

Normalisierungsform C: Zerlegen und anschließend kanonisch zusammensetzen, um die kürzeste Form zu …

NFKC (Compatibility Composition)

Normalisierungsform KC: Kompatibilitätszerlegung gefolgt von kanonischer Zusammensetzung. Führt visuell ähnliche Zeichen zusammen …

NFKD (Compatibility Decomposition)

Normalisierungsform KD: Kompatibilitätszerlegung ohne Zusammensetzung. Die aggressivste Normalisierung mit dem höchsten Verlust …

Normalisierung

Prozess der Umwandlung von Unicode-Text in eine standardisierte kanonische Form. Vier Formen: …

Satzgrenze

Die Position zwischen Sätzen gemäß den Unicode-Regeln. Komplexer als das bloße Aufteilen …

Sortieralgorithmus

Standardalgorithmus zum Vergleichen und Sortieren von Unicode-Zeichenketten mittels mehrstufigem Vergleich: Grundzeichen → …

← Zurück zum Glossar