Algoritma

NFD (Canonical Decomposition)

Normalization Form D: dekomposisi penuh tanpa rekomposisi. Digunakan oleh sistem file macOS HFS+. é (U+00E9) → e + ◌́ (U+0065 + U+0301).

· Updated

NFD: Decomposing Characters to Their Components

NFD (Normalization Form D — Canonical Decomposition) is the form where every composed character is broken down into its constituent parts: a base character followed by one or more combining marks in canonical order. Unlike NFC which recomposes after decomposing, NFD leaves everything decomposed.

For é (U+00E9, LATIN SMALL LETTER E WITH ACUTE), NFD yields the two-code-point sequence e (U+0065) + ́ (U+0301, COMBINING ACUTE ACCENT). For a string like "über", NFD yields u + combining diaeresis + b + e + r — five code points for a four-character word.

When NFD is Used

NFD is the internal form used by Apple's HFS+ and APFS file systems. When macOS writes a filename to disk, it stores it in NFD. This is why filenames created on a Mac can cause issues when transferred to Linux systems: the file café is stored as 5 code points (NFD) on HFS+ but most Linux applications expect 4 code points (NFC). Python's os module will give you the NFD filename on macOS unless you normalize it.

NFD is also useful when you need to strip diacritics:

import unicodedata

def strip_accents(text: str) -> str:
    # Decompose to base + combining marks, then remove all combining marks
    nfd = unicodedata.normalize("NFD", text)
    return "".join(
        c for c in nfd
        if unicodedata.category(c) != "Mn"  # Mn = Mark, Nonspacing
    )

print(strip_accents("naïve"))   # naive
print(strip_accents("résumé"))  # resume
print(strip_accents("über"))    # uber

Combining Mark Order

A subtle but important aspect of NFD: when multiple combining marks follow a base character, Unicode specifies their Canonical Combining Class (CCC) order. Marks with lower CCC values come first (CCC=0 is the base character). The acute accent has CCC=230, the cedilla has CCC=202. NFD ensures combining marks are always in this canonical order, which is necessary for correct canonical equivalence testing.

import unicodedata

# Check combining class of a character
print(unicodedata.combining("\u0301"))  # 230 (COMBINING ACUTE ACCENT)
print(unicodedata.combining("\u0327"))  # 202 (COMBINING CEDILLA)
print(unicodedata.combining("a"))       # 0   (not a combining mark)

Quick Facts

Property Value
Full name Normalization Form Canonical Decomposition
Algorithm Recursive canonical decomposition, then CCC sorting
macOS HFS+ Filenames stored in NFD
Typical use Diacritic stripping, internal processing, canonical comparison
Python unicodedata.normalize("NFD", s)
Handles compatibility chars? No
Relation to NFC NFD then compose = NFC
String length Equal or longer than NFC (decomposed forms use more code points)

Istilah Terkait

Lainnya di Algoritma