The Unicode Odyssey · Chương 7
Normalization: When Equal Isn't Equal
Two strings that look identical might not be equal. This chapter explains the four Unicode normalization forms (NFC, NFD, NFKC, NFKD), canonical equivalence, and compatibility decomposition.
Here is a riddle: two strings look identical on screen, produce identical output when printed, and yet compare as unequal using ==. No compiler warning, no runtime error — just silent inequality that causes database uniqueness violations, search failures, and authentication bugs. This is the normalization problem, and it sits at the intersection of Unicode's greatest strength (flexible character composition) and its greatest practical hazard.
The Root of the Problem: Multiple Representations
Unicode allows many characters to be represented in more than one way. The most common case involves precomposed characters (a single codepoint representing a base character plus diacritic) versus decomposed representations (the same character as a base character followed by a combining diacritic).
The letter "é" can be: - Precomposed: U+00E9 (LATIN SMALL LETTER E WITH ACUTE) — one codepoint - Decomposed: U+0065 U+0301 (LATIN SMALL LETTER E + COMBINING ACUTE ACCENT) — two codepoints
Both render identically. Both are valid Unicode. But they are byte-for-byte different, and string comparison in most programming languages compares bytes or codepoints, not semantics.
e1 = "\\u00e9" # precomposed é
e2 = "e\\u0301" # decomposed e + combining acute
print(e1) # é
print(e2) # é (visually identical)
e1 == e2 # False (!)
len(e1) # 1
len(e2) # 2
This problem appears in real systems constantly. A user registers with username "café". They try to log in — and the system says the username doesn't exist, because their keyboard or clipboard produced a different normalization than the registration form. A database receives the same word in different normalizations from different clients and stores what appear to be duplicates.
Canonical Equivalence vs. Compatibility Equivalence
Unicode defines two types of equivalence between character sequences:
Canonical equivalence: Two sequences are canonically equivalent if they represent the same abstract character, just written differently. The precomposed "é" and decomposed "e + combining acute" are canonically equivalent. Canonically equivalent strings should always be treated as identical — they are the same text, just encoded differently.
Compatibility equivalence: A broader equivalence where characters that look similar or have related meanings are considered equivalent, even if they're not the same abstract character. For example: - The full-width digit "2" (U+FF12) is compatibility-equivalent to "2" (U+0032) - The ligature "fi" (U+FB01) is compatibility-equivalent to "f" + "i" - The fraction "½" (U+00BD) is compatibility-equivalent to "1/2" - Mathematical bold capital A "𝐀" (U+1D400) is compatibility-equivalent to "A"
Compatibility equivalence is more aggressive — it normalizes away formatting distinctions. Use it cautiously, as it loses information (you can't recover the original ligature or full-width form after compatibility decomposition).
The Four Normal Forms
Unicode defines four normalization forms, based on two choices: canonical or compatibility equivalence, and whether to compose or decompose:
| Form | Type | Composition |
|---|---|---|
| NFD | Canonical | Decomposed |
| NFC | Canonical | Composed |
| NFKD | Compatibility | Decomposed |
| NFKC | Compatibility | Composed |
NFD: Canonical Decomposition
NFD decomposes all characters to their canonical decomposed forms and orders combining characters using the Canonical Combining Class (CCC) algorithm.
The CCC is a numeric property (0–254) assigned to each combining character. A CCC of 0 means the character is a "starter" (non-combining). Higher values indicate combining characters, and within a sequence, combining characters are sorted by ascending CCC value.
# NFD of "ẅ" (w with diaeresis, U+1E85)
w + combining diaeresis
U+0077 U+0308
# NFD of "ἃ" (Greek alpha with psili and varia)
U+1F03 decomposes to:
α + combining greek perispomeni + combining greek ypogegrammeni
Ordered by CCC: CCC=230 diacritics before CCC=240 iota subscript
NFD is useful when you need to manipulate the components of composed characters — for example, stripping all diacritics from Latin text by decomposing and then removing characters in category Mn.
NFC: Canonical Decomposition + Canonical Composition
NFC first applies NFD (canonical decomposition + CCC reordering), then applies canonical composition — combining decomposed sequences back into precomposed characters where a precomposed form exists in Unicode.
NFC is the most common choice for storage and interchange because it: - Produces the shortest canonical representation in most cases - Is what macOS, Linux, and most web systems produce by default - Is required by W3C for XML
import unicodedata
e_composed = unicodedata.normalize("NFC", "e\\u0301")
# Result: "\\u00e9" — precomposed é, one codepoint
NFKD and NFKC: Compatibility Forms
NFKD applies compatibility decomposition (more aggressive than canonical), unfolding ligatures, full-width characters, superscripts, and other "formatted" variants into their base components, then applies CCC reordering.
NFKC applies NFKD then canonical composition.
unicodedata.normalize("NFKC", "\\ufb01") # fi ligature → "fi"
unicodedata.normalize("NFKC", "\\uff12") # full-width 2 → "2"
unicodedata.normalize("NFKC", "\\u00bd") # ½ → "1/2"? No — ½ stays ½
# (½ has canonical decomposition, NFKC doesn't change it)
# But superscript digits:
unicodedata.normalize("NFKC", "\\u00b2") # ² → "2"
Use NFKC/NFKD for search indexes, case-insensitive comparisons, and username sanitization — anywhere formatting distinctions should be ignored.
Which Form to Use When
| Use Case | Recommended Form | Reason |
|---|---|---|
| Database storage (general) | NFC | Shortest, standard, reversible |
| Database storage (macOS files) | NFC | macOS HFS+ uses NFD; normalize incoming |
| Web content interchange | NFC | W3C requirement for XML/HTML |
| Search indexing | NFKC | Unify full-width, ligatures, compatibility variants |
| Username uniqueness | NFKC + case fold | Prevent homograph registrations |
| Diacritic stripping | NFD → remove Mn | Decompose, then filter combining marks |
| Cryptographic signatures | NFC | Canonical, reversible, standard |
| Source code identifiers | NFC | Most language specs require NFC normalization |
The Quick_Check Optimization
Computing full normalization is non-trivial — the algorithm involves decomposition lookups, CCC sorting, and composition table lookups. For frequent comparisons, the Unicode standard provides a Quick_Check property for each character and normalization form:
QC_NFC=Y(Yes): This character sequence is definitely in NFCQC_NFC=N(No): This character sequence definitely violates NFCQC_NFC=M(Maybe): Need to check context to determine
Libraries use Quick_Check to short-circuit the normalization algorithm: if all characters in a string have QC_NFC=Y, the string is already in NFC and no processing is needed. This makes normalization checking fast for the common case of already-normalized input.
Hangul and Algorithmic Normalization
Korean Hangul syllable blocks have an interesting normalization behavior. The 11,172 precomposed syllable blocks (U+AC00–U+D7A3) are the canonical composed forms. The jamo components (individual consonants and vowels) can compose into syllable blocks via NFC, and syllable blocks decompose into jamo sequences under NFD — all computed algorithmically without lookup tables, as described in the writing systems chapter.
Normalization and Security
Incorrect normalization handling is a security issue. Consider a web application that:
- Validates a username as alphanumeric using
[a-zA-Z0-9]+ - Stores it in the database without normalization
- Another user registers with NFKC-equivalent username using full-width characters
The validation might pass (full-width digits may not match [0-9]), but after normalization they'd be identical to an existing user. Or consider an authentication system that normalizes at registration but not at login — a user could be locked out because their password's form changed.
The OWASP recommendation: normalize at input boundaries, before validation, before storage, and before comparison. Pick one form (NFC or NFKC) and apply it consistently throughout your entire data pipeline.
Practical Implementation
Every major language provides normalization:
# Python
import unicodedata
normalized = unicodedata.normalize("NFC", user_input)
// JavaScript
const normalized = userInput.normalize("NFC");
// Forms: "NFC", "NFD", "NFKC", "NFKD"
// Java
import java.text.Normalizer;
String normalized = Normalizer.normalize(input, Normalizer.Form.NFC);
// Rust — using unicode-normalization crate
use unicode_normalization::UnicodeNormalization;
let normalized: String = input.nfc().collect();
The one-line rule: always normalize text at system boundaries — when data enters your system from external sources, when you store it, and when you compare strings for equality. The normalization form you choose matters less than choosing one and applying it consistently. NFC is the safest default for most applications.
Unicode's normalization system is a sophisticated answer to a genuine problem: allowing flexibility in how text is encoded while providing a path to canonical, comparable representations. Mastering it is the difference between a Unicode-aware application and one that silently corrupts user data.