📚 Unicode Fundamentals

Unicode Normalization: NFC, NFD, NFKC, NFKD

The same visible character can be represented by multiple different byte sequences in Unicode, which causes silent bugs in string comparison, hashing, and search. This guide explains the four normalization forms — NFC, NFD, NFKC, and NFKD — and when to apply each.

Published 2021-07-19 · Updated 2024-12-04

Two Unicode strings can look identical on screen yet compare as unequal in code. A user types "café" and your validator rejects it because the "é" in the database was stored differently than the "é" the user typed. This is not a bug in your comparison logic — it's a consequence of Unicode allowing the same visual character to be encoded in multiple, equally valid ways. Unicode normalization is the process that resolves this ambiguity. Understanding it prevents a whole class of subtle, hard-to-reproduce bugs.

Why Multiple Representations Exist

Unicode assigns code points both to precomposed characters (a single code point representing a base letter plus its accent) and to combining character sequences (the base letter followed by one or more combining marks as separate code points).

For example, the character "é" (LATIN SMALL LETTER E WITH ACUTE) has two valid representations:

Precomposed: U+00E9 — a single code point LATIN SMALL LETTER E WITH ACUTE
Decomposed: U+0065 U+0301 — LATIN SMALL LETTER E + COMBINING ACUTE ACCENT

Both render identically. Both are valid Unicode. But they are different byte sequences and will fail a naive string equality check:

e_precomposed = "\u00e9"          # é as single code point
e_decomposed  = "e\u0301"        # e + combining acute

print(e_precomposed == e_decomposed)   # False
print(len(e_precomposed))             # 1
print(len(e_decomposed))              # 2

This is the problem normalization solves. By converting both strings to the same canonical form before comparison, you get consistent, predictable results.

The Four Normalization Forms

Unicode defines four normalization forms, organized along two axes:

Form	Full Name	Decomposition	Composition
NFD	Canonical Decomposition	Canonical	None
NFC	Canonical Decomposition, followed by Canonical Composition	Canonical	Canonical
NFKD	Compatibility Decomposition	Compatibility	None
NFKC	Compatibility Decomposition, followed by Canonical Composition	Compatibility	Canonical

The two key concepts:

Canonical equivalence: Two sequences that are canonically equivalent look and behave identically. The precomposed "é" and the decomposed "e + combining acute" are canonically equivalent — they are just different serializations of the same abstract character.

Compatibility equivalence: A broader category that also covers characters that are visually similar or semantically equivalent but not necessarily identical in appearance. For example, the ligature "ﬁ" (U+FB01, LATIN SMALL LIGATURE FI) is compatibility-equivalent to the two-character sequence "fi". They look nearly identical but are arguably different glyphs.

NFD — Canonical Decomposition

NFD converts every precomposed character into its canonical decomposed form. Accented letters are split into their base letter plus combining mark(s). The combining marks are then sorted into a canonical order (by combining class).

import unicodedata

text = "café"
nfd = unicodedata.normalize("NFD", text)

print(repr(nfd))    # 'cafe\u0301'
print(len(nfd))     # 5  (c, a, f, e, U+0301)

NFD is useful when you need to manipulate base characters and diacritics separately — for example, stripping all accent marks from a string:

import unicodedata

def strip_accents(text: str) -> str:
    nfd = unicodedata.normalize("NFD", text)
    return "".join(c for c in nfd if unicodedata.category(c) != "Mn")

print(strip_accents("Héllo Wörld"))  # Hello World

(Mn is the Unicode category for "Mark, Nonspacing" — i.e., combining diacritical marks.)

NFC — Canonical Decomposition + Composition

NFC first applies NFD (canonical decomposition), then re-composes the resulting sequences back into precomposed characters wherever a precomposed form exists in Unicode.

NFC is the most compact canonical form for languages that use precomposed characters (most Western European languages, Greek, Cyrillic). It's also the form used by most operating systems for file names and text input.

import unicodedata

decomposed = "e\u0301"   # e + combining acute
nfc = unicodedata.normalize("NFC", decomposed)

print(repr(nfc))    # '\xe9'  — precomposed é
print(len(nfc))     # 1

NFC is the recommended normalization form for most use cases: database storage, string comparison, key generation (e.g., URL slugs), and API inputs.

def normalize_input(text: str) -> str:
    '''Normalize user input to NFC for consistent storage and comparison.'''
    return unicodedata.normalize("NFC", text)

NFKD and NFKC — Compatibility Forms

The K forms apply compatibility decomposition in addition to canonical decomposition. This maps characters to their compatibility equivalents — resolving ligatures, width variants, circled numbers, and other "compatibility characters" into their base forms.

Examples of compatibility mappings:

Original	Code Point	NFKD/NFKC Result	Reason
ﬁ	U+FB01	fi	Ligature → components
①	U+2460	1	Circled digit → digit
²	U+00B2	2	Superscript → digit
ＡＢＣ	U+FF21–U+FF23	ABC	Fullwidth → ASCII
™	U+2122	TM	Trademark → letters
ﷺ	U+FDFA	صلى الله عليه وسلم	Presentation form → sequence

import unicodedata

ligature = "ﬁle"  # ﬁ is U+FB01 LATIN SMALL LIGATURE FI
print(unicodedata.normalize("NFKD", ligature))   # file
print(unicodedata.normalize("NFKC", ligature))   # file
print(unicodedata.normalize("NFC", ligature))    # ﬁle (unchanged)

Warning: Compatibility normalization is lossy. Once you apply NFKD or NFKC, you cannot recover the original ligature or special character. Only use it when you explicitly want to flatten all stylistic variants — for example, in search indexing or username normalization.

When to Use Each Form

Use Case	Recommended Form	Reason
Database storage	NFC	Compact, precomposed, standard on macOS/Windows
String comparison	NFC	Compare normalized against normalized
Removing accents	NFD + filter Mn	NFD puts combining marks as separate code points
Search indexing	NFKC	Flatten ligatures, width variants, fractions
Username normalization	NFKC + case fold	Prevent "ＡＢＣ" != "ABC" username conflicts
Security-sensitive comparison	NFKC + case fold	Prevent homoglyph attacks (partially)
File system (macOS)	NFD	HFS+ uses NFD for file names
File system (Linux/Windows)	NFC	ext4 and NTFS treat strings as opaque bytes

Note: macOS's HFS+ file system stores file names in NFD, while Linux and Windows treat them as opaque byte sequences. This can cause issues when transferring files between systems:

# A file named "café" created on macOS may use NFD ("cafe\u0301")
# The same name on Linux is likely NFC ("\u00e9")
# Normalizing to NFC before file operations avoids surprises
import unicodedata, pathlib

def safe_path(name: str) -> pathlib.Path:
    return pathlib.Path(unicodedata.normalize("NFC", name))

Canonical Ordering

NFD and NFKD don't just decompose — they also sort combining marks into canonical order determined by the Canonical Combining Class (CCC) property. Combining marks with lower CCC values come first. This ensures that sequences with the same visual result but different ordering of combining marks are normalized to identical sequences.

For example, a character with both a cedilla (CCC=202) and an acute accent (CCC=230) will always have the cedilla first after NFD, regardless of the input order:

import unicodedata

# Two equivalent sequences with combining marks in different orders
s1 = "\u0041\u0301\u0327"  # A + acute + cedilla
s2 = "\u0041\u0327\u0301"  # A + cedilla + acute

print(s1 == s2)    # False (different byte order)
print(unicodedata.normalize("NFD", s1) == unicodedata.normalize("NFD", s2))   # True

Python Examples

import unicodedata

def compare_unicode(a: str, b: str) -> bool:
    '''Compare two Unicode strings regardless of normalization form.'''
    return unicodedata.normalize("NFC", a) == unicodedata.normalize("NFC", b)

cafe1 = "caf\u00e9"       # NFC: é as single code point
cafe2 = "cafe\u0301"      # NFD: e + combining acute

print(cafe1 == cafe2)               # False
print(compare_unicode(cafe1, cafe2)) # True

# Check what form a string is in
def normalization_form(text: str) -> str:
    for form in ("NFC", "NFD", "NFKC", "NFKD"):
        if unicodedata.is_normalized(form, text):
            return form
    return "none"

print(normalization_form("café"))   # NFC or NFD depending on input

JavaScript Examples

JavaScript provides String.prototype.normalize():

const precomposed = "\u00e9";       // é
const decomposed  = "e\u0301";     // e + combining acute

console.log(precomposed === decomposed);          // false
console.log(precomposed.normalize("NFC") === decomposed.normalize("NFC")); // true

// NFKC for search normalization
const ligature = "\ufb01le";        // ﬁle
console.log(ligature.normalize("NFKC"));   // "file"

// Stripping accents in JS
function stripAccents(str) {
    return str.normalize("NFD").replace(/[\u0300-\u036f]/g, "");
}
console.log(stripAccents("Héllo Wörld"));  // "Hello World"

Key Takeaways

Unicode allows the same visual character to be represented as a precomposed code point or as a base character + combining mark(s). Without normalization, these compare as unequal.
NFC (canonical decomposition + recomposition into precomposed) is the recommended form for storage and comparison — compact and widely expected.
NFD (canonical decomposition only) is useful when you need to separate base characters from combining marks, e.g., to strip accents.
NFKC and NFKD additionally flatten compatibility variants (ligatures, width variants, superscripts). Use them for search indexing and username normalization, but beware they are lossy.
Always normalize to the same form before comparing strings that may come from different sources (user input, file system, database, API).
In Python: unicodedata.normalize(form, text). In JavaScript: str.normalize(form).