Unicode Normalization: NFC, NFD, NFKC, NFKD
The same visible character can be represented by multiple different byte sequences in Unicode, which causes silent bugs in string comparison, hashing, and search. This guide explains the four normalization forms — NFC, NFD, NFKC, and NFKD — and when to apply each.
Two Unicode strings can look identical on screen yet compare as unequal in code. A user types "café" and your validator rejects it because the "é" in the database was stored differently than the "é" the user typed. This is not a bug in your comparison logic — it's a consequence of Unicode allowing the same visual character to be encoded in multiple, equally valid ways. Unicode normalization is the process that resolves this ambiguity. Understanding it prevents a whole class of subtle, hard-to-reproduce bugs.
Why Multiple Representations Exist
Unicode assigns code points both to precomposed characters (a single code point representing a base letter plus its accent) and to combining character sequences (the base letter followed by one or more combining marks as separate code points).
For example, the character "é" (LATIN SMALL LETTER E WITH ACUTE) has two valid representations:
- Precomposed: U+00E9 — a single code point
LATIN SMALL LETTER E WITH ACUTE - Decomposed: U+0065 U+0301 —
LATIN SMALL LETTER E+COMBINING ACUTE ACCENT
Both render identically. Both are valid Unicode. But they are different byte sequences and will fail a naive string equality check:
e_precomposed = "\u00e9" # é as single code point
e_decomposed = "e\u0301" # e + combining acute
print(e_precomposed == e_decomposed) # False
print(len(e_precomposed)) # 1
print(len(e_decomposed)) # 2
This is the problem normalization solves. By converting both strings to the same canonical form before comparison, you get consistent, predictable results.
The Four Normalization Forms
Unicode defines four normalization forms, organized along two axes:
| Form | Full Name | Decomposition | Composition |
|---|---|---|---|
| NFD | Canonical Decomposition | Canonical | None |
| NFC | Canonical Decomposition, followed by Canonical Composition | Canonical | Canonical |
| NFKD | Compatibility Decomposition | Compatibility | None |
| NFKC | Compatibility Decomposition, followed by Canonical Composition | Compatibility | Canonical |
The two key concepts:
Canonical equivalence: Two sequences that are canonically equivalent look and behave identically. The precomposed "é" and the decomposed "e + combining acute" are canonically equivalent — they are just different serializations of the same abstract character.
Compatibility equivalence: A broader category that also covers characters that are visually similar or semantically equivalent but not necessarily identical in appearance. For example, the ligature "fi" (U+FB01, LATIN SMALL LIGATURE FI) is compatibility-equivalent to the two-character sequence "fi". They look nearly identical but are arguably different glyphs.
NFD — Canonical Decomposition
NFD converts every precomposed character into its canonical decomposed form. Accented letters are split into their base letter plus combining mark(s). The combining marks are then sorted into a canonical order (by combining class).
import unicodedata
text = "café"
nfd = unicodedata.normalize("NFD", text)
print(repr(nfd)) # 'cafe\u0301'
print(len(nfd)) # 5 (c, a, f, e, U+0301)
NFD is useful when you need to manipulate base characters and diacritics separately — for example, stripping all accent marks from a string:
import unicodedata
def strip_accents(text: str) -> str:
nfd = unicodedata.normalize("NFD", text)
return "".join(c for c in nfd if unicodedata.category(c) != "Mn")
print(strip_accents("Héllo Wörld")) # Hello World
(Mn is the Unicode category for "Mark, Nonspacing" — i.e., combining diacritical marks.)
NFC — Canonical Decomposition + Composition
NFC first applies NFD (canonical decomposition), then re-composes the resulting sequences back into precomposed characters wherever a precomposed form exists in Unicode.
NFC is the most compact canonical form for languages that use precomposed characters (most Western European languages, Greek, Cyrillic). It's also the form used by most operating systems for file names and text input.
import unicodedata
decomposed = "e\u0301" # e + combining acute
nfc = unicodedata.normalize("NFC", decomposed)
print(repr(nfc)) # '\xe9' — precomposed é
print(len(nfc)) # 1
NFC is the recommended normalization form for most use cases: database storage, string comparison, key generation (e.g., URL slugs), and API inputs.
def normalize_input(text: str) -> str:
'''Normalize user input to NFC for consistent storage and comparison.'''
return unicodedata.normalize("NFC", text)
NFKD and NFKC — Compatibility Forms
The K forms apply compatibility decomposition in addition to canonical decomposition.
This maps characters to their compatibility equivalents — resolving ligatures, width variants,
circled numbers, and other "compatibility characters" into their base forms.
Examples of compatibility mappings:
| Original | Code Point | NFKD/NFKC Result | Reason |
|---|---|---|---|
| fi | U+FB01 | fi | Ligature → components |
| ① | U+2460 | 1 | Circled digit → digit |
| ² | U+00B2 | 2 | Superscript → digit |
| ABC | U+FF21–U+FF23 | ABC | Fullwidth → ASCII |
| ™ | U+2122 | TM | Trademark → letters |
| ﷺ | U+FDFA | صلى الله عليه وسلم | Presentation form → sequence |
import unicodedata
ligature = "file" # fi is U+FB01 LATIN SMALL LIGATURE FI
print(unicodedata.normalize("NFKD", ligature)) # file
print(unicodedata.normalize("NFKC", ligature)) # file
print(unicodedata.normalize("NFC", ligature)) # file (unchanged)
Warning: Compatibility normalization is lossy. Once you apply NFKD or NFKC, you cannot recover the original ligature or special character. Only use it when you explicitly want to flatten all stylistic variants — for example, in search indexing or username normalization.
When to Use Each Form
| Use Case | Recommended Form | Reason |
|---|---|---|
| Database storage | NFC | Compact, precomposed, standard on macOS/Windows |
| String comparison | NFC | Compare normalized against normalized |
| Removing accents | NFD + filter Mn | NFD puts combining marks as separate code points |
| Search indexing | NFKC | Flatten ligatures, width variants, fractions |
| Username normalization | NFKC + case fold | Prevent "ABC" != "ABC" username conflicts |
| Security-sensitive comparison | NFKC + case fold | Prevent homoglyph attacks (partially) |
| File system (macOS) | NFD | HFS+ uses NFD for file names |
| File system (Linux/Windows) | NFC | ext4 and NTFS treat strings as opaque bytes |
Note: macOS's HFS+ file system stores file names in NFD, while Linux and Windows treat them as opaque byte sequences. This can cause issues when transferring files between systems:
# A file named "café" created on macOS may use NFD ("cafe\u0301")
# The same name on Linux is likely NFC ("\u00e9")
# Normalizing to NFC before file operations avoids surprises
import unicodedata, pathlib
def safe_path(name: str) -> pathlib.Path:
return pathlib.Path(unicodedata.normalize("NFC", name))
Canonical Ordering
NFD and NFKD don't just decompose — they also sort combining marks into canonical order determined by the Canonical Combining Class (CCC) property. Combining marks with lower CCC values come first. This ensures that sequences with the same visual result but different ordering of combining marks are normalized to identical sequences.
For example, a character with both a cedilla (CCC=202) and an acute accent (CCC=230) will always have the cedilla first after NFD, regardless of the input order:
import unicodedata
# Two equivalent sequences with combining marks in different orders
s1 = "\u0041\u0301\u0327" # A + acute + cedilla
s2 = "\u0041\u0327\u0301" # A + cedilla + acute
print(s1 == s2) # False (different byte order)
print(unicodedata.normalize("NFD", s1) == unicodedata.normalize("NFD", s2)) # True
Python Examples
import unicodedata
def compare_unicode(a: str, b: str) -> bool:
'''Compare two Unicode strings regardless of normalization form.'''
return unicodedata.normalize("NFC", a) == unicodedata.normalize("NFC", b)
cafe1 = "caf\u00e9" # NFC: é as single code point
cafe2 = "cafe\u0301" # NFD: e + combining acute
print(cafe1 == cafe2) # False
print(compare_unicode(cafe1, cafe2)) # True
# Check what form a string is in
def normalization_form(text: str) -> str:
for form in ("NFC", "NFD", "NFKC", "NFKD"):
if unicodedata.is_normalized(form, text):
return form
return "none"
print(normalization_form("café")) # NFC or NFD depending on input
JavaScript Examples
JavaScript provides String.prototype.normalize():
const precomposed = "\u00e9"; // é
const decomposed = "e\u0301"; // e + combining acute
console.log(precomposed === decomposed); // false
console.log(precomposed.normalize("NFC") === decomposed.normalize("NFC")); // true
// NFKC for search normalization
const ligature = "\ufb01le"; // file
console.log(ligature.normalize("NFKC")); // "file"
// Stripping accents in JS
function stripAccents(str) {
return str.normalize("NFD").replace(/[\u0300-\u036f]/g, "");
}
console.log(stripAccents("Héllo Wörld")); // "Hello World"
Key Takeaways
- Unicode allows the same visual character to be represented as a precomposed code point or as a base character + combining mark(s). Without normalization, these compare as unequal.
- NFC (canonical decomposition + recomposition into precomposed) is the recommended form for storage and comparison — compact and widely expected.
- NFD (canonical decomposition only) is useful when you need to separate base characters from combining marks, e.g., to strip accents.
- NFKC and NFKD additionally flatten compatibility variants (ligatures, width variants, superscripts). Use them for search indexing and username normalization, but beware they are lossy.
- Always normalize to the same form before comparing strings that may come from different sources (user input, file system, database, API).
- In Python:
unicodedata.normalize(form, text). In JavaScript:str.normalize(form).
Mais em Unicode Fundamentals
Unicode is the universal character encoding standard that assigns a unique number …
UTF-8 is the dominant character encoding on the web, capable of representing …
UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …
A Unicode code point is the unique number assigned to each character …
Unicode is divided into 17 planes, each containing up to 65,536 code …
The Byte Order Mark (BOM) is a special Unicode character used at …
Surrogate pairs are a mechanism in UTF-16 that allows code points outside …
ASCII defined 128 characters for the English alphabet and was the foundation …
The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …
Every Unicode character belongs to a general category such as Letter, Number, …
Unicode blocks are contiguous ranges of code points grouped by script or …
Unicode assigns every character to a script property that identifies the writing …
Combining characters are Unicode code points that attach to a preceding base …
A single visible character on screen — called a grapheme cluster — …
Unicode confusables are characters that look identical or nearly identical to others, …
Zero-width characters are invisible Unicode code points that affect text layout, joining, …
Unicode defines over two dozen whitespace characters beyond the ordinary space, including …
Unicode began in 1987 as a collaboration between engineers at Apple and …
Unicode has released major versions regularly since 1.0 in 1991, with each …