The Unicode Odyssey · Chapitre 7

Normalization: When Equal Isn't Equal

Two strings that look identical might not be equal. This chapter explains the four Unicode normalization forms (NFC, NFD, NFKC, NFKD), canonical equivalence, and compatibility decomposition.

~4 500 mots · ~18 min de lecture · · Updated

Here is a riddle: two strings look identical on screen, produce identical output when printed, and yet compare as unequal using ==. No compiler warning, no runtime error — just silent inequality that causes database uniqueness violations, search failures, and authentication bugs. This is the normalization problem, and it sits at the intersection of Unicode's greatest strength (flexible character composition) and its greatest practical hazard.

The Root of the Problem: Multiple Representations

Unicode allows many characters to be represented in more than one way. The most common case involves precomposed characters (a single codepoint representing a base character plus diacritic) versus decomposed representations (the same character as a base character followed by a combining diacritic).

The letter "é" can be: - Precomposed: U+00E9 (LATIN SMALL LETTER E WITH ACUTE) — one codepoint - Decomposed: U+0065 U+0301 (LATIN SMALL LETTER E + COMBINING ACUTE ACCENT) — two codepoints

Both render identically. Both are valid Unicode. But they are byte-for-byte different, and string comparison in most programming languages compares bytes or codepoints, not semantics.

e1 = "\\u00e9"          # precomposed é
e2 = "e\\u0301"         # decomposed e + combining acute

print(e1)               # é
print(e2)               # é (visually identical)
e1 == e2                # False (!)
len(e1)                 # 1
len(e2)                 # 2

This problem appears in real systems constantly. A user registers with username "café". They try to log in — and the system says the username doesn't exist, because their keyboard or clipboard produced a different normalization than the registration form. A database receives the same word in different normalizations from different clients and stores what appear to be duplicates.

Canonical Equivalence vs. Compatibility Equivalence

Unicode defines two types of equivalence between character sequences:

Canonical equivalence: Two sequences are canonically equivalent if they represent the same abstract character, just written differently. The precomposed "é" and decomposed "e + combining acute" are canonically equivalent. Canonically equivalent strings should always be treated as identical — they are the same text, just encoded differently.

Compatibility equivalence: A broader equivalence where characters that look similar or have related meanings are considered equivalent, even if they're not the same abstract character. For example: - The full-width digit "2" (U+FF12) is compatibility-equivalent to "2" (U+0032) - The ligature "fi" (U+FB01) is compatibility-equivalent to "f" + "i" - The fraction "½" (U+00BD) is compatibility-equivalent to "1/2" - Mathematical bold capital A "𝐀" (U+1D400) is compatibility-equivalent to "A"

Compatibility equivalence is more aggressive — it normalizes away formatting distinctions. Use it cautiously, as it loses information (you can't recover the original ligature or full-width form after compatibility decomposition).

The Four Normal Forms

Unicode defines four normalization forms, based on two choices: canonical or compatibility equivalence, and whether to compose or decompose:

Form Type Composition
NFD Canonical Decomposed
NFC Canonical Composed
NFKD Compatibility Decomposed
NFKC Compatibility Composed

NFD: Canonical Decomposition

NFD decomposes all characters to their canonical decomposed forms and orders combining characters using the Canonical Combining Class (CCC) algorithm.

The CCC is a numeric property (0–254) assigned to each combining character. A CCC of 0 means the character is a "starter" (non-combining). Higher values indicate combining characters, and within a sequence, combining characters are sorted by ascending CCC value.

# NFD of "ẅ" (w with diaeresis, U+1E85)
w  +  combining diaeresis
U+0077  U+0308
# NFD of "ἃ" (Greek alpha with psili and varia)
U+1F03 decomposes to:
α + combining greek perispomeni + combining greek ypogegrammeni
Ordered by CCC: CCC=230 diacritics before CCC=240 iota subscript

NFD is useful when you need to manipulate the components of composed characters — for example, stripping all diacritics from Latin text by decomposing and then removing characters in category Mn.

NFC: Canonical Decomposition + Canonical Composition

NFC first applies NFD (canonical decomposition + CCC reordering), then applies canonical composition — combining decomposed sequences back into precomposed characters where a precomposed form exists in Unicode.

NFC is the most common choice for storage and interchange because it: - Produces the shortest canonical representation in most cases - Is what macOS, Linux, and most web systems produce by default - Is required by W3C for XML

import unicodedata
e_composed = unicodedata.normalize("NFC", "e\\u0301")
# Result: "\\u00e9" — precomposed é, one codepoint

NFKD and NFKC: Compatibility Forms

NFKD applies compatibility decomposition (more aggressive than canonical), unfolding ligatures, full-width characters, superscripts, and other "formatted" variants into their base components, then applies CCC reordering.

NFKC applies NFKD then canonical composition.

unicodedata.normalize("NFKC", "\\ufb01")  # fi ligature → "fi"
unicodedata.normalize("NFKC", "\\uff12")  # full-width 2 → "2"
unicodedata.normalize("NFKC", "\\u00bd")  # ½ → "1/2"? No — ½ stays ½
# (½ has canonical decomposition, NFKC doesn't change it)

# But superscript digits:
unicodedata.normalize("NFKC", "\\u00b2")  # ² → "2"

Use NFKC/NFKD for search indexes, case-insensitive comparisons, and username sanitization — anywhere formatting distinctions should be ignored.

Which Form to Use When

Use Case Recommended Form Reason
Database storage (general) NFC Shortest, standard, reversible
Database storage (macOS files) NFC macOS HFS+ uses NFD; normalize incoming
Web content interchange NFC W3C requirement for XML/HTML
Search indexing NFKC Unify full-width, ligatures, compatibility variants
Username uniqueness NFKC + case fold Prevent homograph registrations
Diacritic stripping NFD → remove Mn Decompose, then filter combining marks
Cryptographic signatures NFC Canonical, reversible, standard
Source code identifiers NFC Most language specs require NFC normalization

The Quick_Check Optimization

Computing full normalization is non-trivial — the algorithm involves decomposition lookups, CCC sorting, and composition table lookups. For frequent comparisons, the Unicode standard provides a Quick_Check property for each character and normalization form:

  • QC_NFC=Y (Yes): This character sequence is definitely in NFC
  • QC_NFC=N (No): This character sequence definitely violates NFC
  • QC_NFC=M (Maybe): Need to check context to determine

Libraries use Quick_Check to short-circuit the normalization algorithm: if all characters in a string have QC_NFC=Y, the string is already in NFC and no processing is needed. This makes normalization checking fast for the common case of already-normalized input.

Hangul and Algorithmic Normalization

Korean Hangul syllable blocks have an interesting normalization behavior. The 11,172 precomposed syllable blocks (U+AC00–U+D7A3) are the canonical composed forms. The jamo components (individual consonants and vowels) can compose into syllable blocks via NFC, and syllable blocks decompose into jamo sequences under NFD — all computed algorithmically without lookup tables, as described in the writing systems chapter.

Normalization and Security

Incorrect normalization handling is a security issue. Consider a web application that:

  1. Validates a username as alphanumeric using [a-zA-Z0-9]+
  2. Stores it in the database without normalization
  3. Another user registers with NFKC-equivalent username using full-width characters

The validation might pass (full-width digits may not match [0-9]), but after normalization they'd be identical to an existing user. Or consider an authentication system that normalizes at registration but not at login — a user could be locked out because their password's form changed.

The OWASP recommendation: normalize at input boundaries, before validation, before storage, and before comparison. Pick one form (NFC or NFKC) and apply it consistently throughout your entire data pipeline.

Practical Implementation

Every major language provides normalization:

# Python
import unicodedata
normalized = unicodedata.normalize("NFC", user_input)
// JavaScript
const normalized = userInput.normalize("NFC");
// Forms: "NFC", "NFD", "NFKC", "NFKD"
// Java
import java.text.Normalizer;
String normalized = Normalizer.normalize(input, Normalizer.Form.NFC);
// Rust — using unicode-normalization crate
use unicode_normalization::UnicodeNormalization;
let normalized: String = input.nfc().collect();

The one-line rule: always normalize text at system boundaries — when data enters your system from external sources, when you store it, and when you compare strings for equality. The normalization form you choose matters less than choosing one and applying it consistently. NFC is the safest default for most applications.

Unicode's normalization system is a sophisticated answer to a genuine problem: allowing flexibility in how text is encoded while providing a path to canonical, comparable representations. Mastering it is the difference between a Unicode-aware application and one that silently corrupts user data.