📚 Unicode Fundamentals

Unicode Normalization: NFC, NFD, NFKC, NFKD

The same visible character can be represented by multiple different byte sequences in Unicode, which causes silent bugs in string comparison, hashing, and search. This guide explains the four normalization forms — NFC, NFD, NFKC, and NFKD — and when to apply each.

·

Two Unicode strings can look identical on screen yet compare as unequal in code. A user types "café" and your validator rejects it because the "é" in the database was stored differently than the "é" the user typed. This is not a bug in your comparison logic — it's a consequence of Unicode allowing the same visual character to be encoded in multiple, equally valid ways. Unicode normalization is the process that resolves this ambiguity. Understanding it prevents a whole class of subtle, hard-to-reproduce bugs.

Why Multiple Representations Exist

Unicode assigns code points both to precomposed characters (a single code point representing a base letter plus its accent) and to combining character sequences (the base letter followed by one or more combining marks as separate code points).

For example, the character "é" (LATIN SMALL LETTER E WITH ACUTE) has two valid representations:

  1. Precomposed: U+00E9 — a single code point LATIN SMALL LETTER E WITH ACUTE
  2. Decomposed: U+0065 U+0301 — LATIN SMALL LETTER E + COMBINING ACUTE ACCENT

Both render identically. Both are valid Unicode. But they are different byte sequences and will fail a naive string equality check:

e_precomposed = "\u00e9"          # é as single code point
e_decomposed  = "e\u0301"        # e + combining acute

print(e_precomposed == e_decomposed)   # False
print(len(e_precomposed))             # 1
print(len(e_decomposed))              # 2

This is the problem normalization solves. By converting both strings to the same canonical form before comparison, you get consistent, predictable results.

The Four Normalization Forms

Unicode defines four normalization forms, organized along two axes:

Form Full Name Decomposition Composition
NFD Canonical Decomposition Canonical None
NFC Canonical Decomposition, followed by Canonical Composition Canonical Canonical
NFKD Compatibility Decomposition Compatibility None
NFKC Compatibility Decomposition, followed by Canonical Composition Compatibility Canonical

The two key concepts:

Canonical equivalence: Two sequences that are canonically equivalent look and behave identically. The precomposed "é" and the decomposed "e + combining acute" are canonically equivalent — they are just different serializations of the same abstract character.

Compatibility equivalence: A broader category that also covers characters that are visually similar or semantically equivalent but not necessarily identical in appearance. For example, the ligature "fi" (U+FB01, LATIN SMALL LIGATURE FI) is compatibility-equivalent to the two-character sequence "fi". They look nearly identical but are arguably different glyphs.

NFD — Canonical Decomposition

NFD converts every precomposed character into its canonical decomposed form. Accented letters are split into their base letter plus combining mark(s). The combining marks are then sorted into a canonical order (by combining class).

import unicodedata

text = "café"
nfd = unicodedata.normalize("NFD", text)

print(repr(nfd))    # 'cafe\u0301'
print(len(nfd))     # 5  (c, a, f, e, U+0301)

NFD is useful when you need to manipulate base characters and diacritics separately — for example, stripping all accent marks from a string:

import unicodedata

def strip_accents(text: str) -> str:
    nfd = unicodedata.normalize("NFD", text)
    return "".join(c for c in nfd if unicodedata.category(c) != "Mn")

print(strip_accents("Héllo Wörld"))  # Hello World

(Mn is the Unicode category for "Mark, Nonspacing" — i.e., combining diacritical marks.)

NFC — Canonical Decomposition + Composition

NFC first applies NFD (canonical decomposition), then re-composes the resulting sequences back into precomposed characters wherever a precomposed form exists in Unicode.

NFC is the most compact canonical form for languages that use precomposed characters (most Western European languages, Greek, Cyrillic). It's also the form used by most operating systems for file names and text input.

import unicodedata

decomposed = "e\u0301"   # e + combining acute
nfc = unicodedata.normalize("NFC", decomposed)

print(repr(nfc))    # '\xe9'  — precomposed é
print(len(nfc))     # 1

NFC is the recommended normalization form for most use cases: database storage, string comparison, key generation (e.g., URL slugs), and API inputs.

def normalize_input(text: str) -> str:
    '''Normalize user input to NFC for consistent storage and comparison.'''
    return unicodedata.normalize("NFC", text)

NFKD and NFKC — Compatibility Forms

The K forms apply compatibility decomposition in addition to canonical decomposition. This maps characters to their compatibility equivalents — resolving ligatures, width variants, circled numbers, and other "compatibility characters" into their base forms.

Examples of compatibility mappings:

Original Code Point NFKD/NFKC Result Reason
U+FB01 fi Ligature → components
U+2460 1 Circled digit → digit
² U+00B2 2 Superscript → digit
ABC U+FF21–U+FF23 ABC Fullwidth → ASCII
U+2122 TM Trademark → letters
U+FDFA صلى الله عليه وسلم Presentation form → sequence
import unicodedata

ligature = "file"  # fi is U+FB01 LATIN SMALL LIGATURE FI
print(unicodedata.normalize("NFKD", ligature))   # file
print(unicodedata.normalize("NFKC", ligature))   # file
print(unicodedata.normalize("NFC", ligature))    # file (unchanged)

Warning: Compatibility normalization is lossy. Once you apply NFKD or NFKC, you cannot recover the original ligature or special character. Only use it when you explicitly want to flatten all stylistic variants — for example, in search indexing or username normalization.

When to Use Each Form

Use Case Recommended Form Reason
Database storage NFC Compact, precomposed, standard on macOS/Windows
String comparison NFC Compare normalized against normalized
Removing accents NFD + filter Mn NFD puts combining marks as separate code points
Search indexing NFKC Flatten ligatures, width variants, fractions
Username normalization NFKC + case fold Prevent "ABC" != "ABC" username conflicts
Security-sensitive comparison NFKC + case fold Prevent homoglyph attacks (partially)
File system (macOS) NFD HFS+ uses NFD for file names
File system (Linux/Windows) NFC ext4 and NTFS treat strings as opaque bytes

Note: macOS's HFS+ file system stores file names in NFD, while Linux and Windows treat them as opaque byte sequences. This can cause issues when transferring files between systems:

# A file named "café" created on macOS may use NFD ("cafe\u0301")
# The same name on Linux is likely NFC ("\u00e9")
# Normalizing to NFC before file operations avoids surprises
import unicodedata, pathlib

def safe_path(name: str) -> pathlib.Path:
    return pathlib.Path(unicodedata.normalize("NFC", name))

Canonical Ordering

NFD and NFKD don't just decompose — they also sort combining marks into canonical order determined by the Canonical Combining Class (CCC) property. Combining marks with lower CCC values come first. This ensures that sequences with the same visual result but different ordering of combining marks are normalized to identical sequences.

For example, a character with both a cedilla (CCC=202) and an acute accent (CCC=230) will always have the cedilla first after NFD, regardless of the input order:

import unicodedata

# Two equivalent sequences with combining marks in different orders
s1 = "\u0041\u0301\u0327"  # A + acute + cedilla
s2 = "\u0041\u0327\u0301"  # A + cedilla + acute

print(s1 == s2)    # False (different byte order)
print(unicodedata.normalize("NFD", s1) == unicodedata.normalize("NFD", s2))   # True

Python Examples

import unicodedata

def compare_unicode(a: str, b: str) -> bool:
    '''Compare two Unicode strings regardless of normalization form.'''
    return unicodedata.normalize("NFC", a) == unicodedata.normalize("NFC", b)

cafe1 = "caf\u00e9"       # NFC: é as single code point
cafe2 = "cafe\u0301"      # NFD: e + combining acute

print(cafe1 == cafe2)               # False
print(compare_unicode(cafe1, cafe2)) # True

# Check what form a string is in
def normalization_form(text: str) -> str:
    for form in ("NFC", "NFD", "NFKC", "NFKD"):
        if unicodedata.is_normalized(form, text):
            return form
    return "none"

print(normalization_form("café"))   # NFC or NFD depending on input

JavaScript Examples

JavaScript provides String.prototype.normalize():

const precomposed = "\u00e9";       // é
const decomposed  = "e\u0301";     // e + combining acute

console.log(precomposed === decomposed);          // false
console.log(precomposed.normalize("NFC") === decomposed.normalize("NFC")); // true

// NFKC for search normalization
const ligature = "\ufb01le";        // file
console.log(ligature.normalize("NFKC"));   // "file"

// Stripping accents in JS
function stripAccents(str) {
    return str.normalize("NFD").replace(/[\u0300-\u036f]/g, "");
}
console.log(stripAccents("Héllo Wörld"));  // "Hello World"

Key Takeaways

  • Unicode allows the same visual character to be represented as a precomposed code point or as a base character + combining mark(s). Without normalization, these compare as unequal.
  • NFC (canonical decomposition + recomposition into precomposed) is the recommended form for storage and comparison — compact and widely expected.
  • NFD (canonical decomposition only) is useful when you need to separate base characters from combining marks, e.g., to strip accents.
  • NFKC and NFKD additionally flatten compatibility variants (ligatures, width variants, superscripts). Use them for search indexing and username normalization, but beware they are lossy.
  • Always normalize to the same form before comparing strings that may come from different sources (user input, file system, database, API).
  • In Python: unicodedata.normalize(form, text). In JavaScript: str.normalize(form).

เพิ่มเติมใน Unicode Fundamentals

What is Unicode? A Complete Guide

Unicode is the universal character encoding standard that assigns a unique number …

UTF-8 Encoding Explained

UTF-8 is the dominant character encoding on the web, capable of representing …

UTF-8 vs UTF-16 vs UTF-32: When to Use Each

UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …

What is a Unicode Code Point?

A Unicode code point is the unique number assigned to each character …

Unicode Planes and the BMP

Unicode is divided into 17 planes, each containing up to 65,536 code …

Understanding Byte Order Mark (BOM)

The Byte Order Mark (BOM) is a special Unicode character used at …

Surrogate Pairs Explained

Surrogate pairs are a mechanism in UTF-16 that allows code points outside …

ASCII to Unicode: The Evolution of Character Encoding

ASCII defined 128 characters for the English alphabet and was the foundation …

The Unicode Bidirectional Algorithm

The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …

Unicode General Categories Explained

Every Unicode character belongs to a general category such as Letter, Number, …

Understanding Unicode Blocks

Unicode blocks are contiguous ranges of code points grouped by script or …

Unicode Scripts: How Writing Systems are Organized

Unicode assigns every character to a script property that identifies the writing …

What are Combining Characters?

Combining characters are Unicode code points that attach to a preceding base …

Grapheme Clusters vs Code Points

A single visible character on screen — called a grapheme cluster — …

Unicode Confusables: A Security Guide

Unicode confusables are characters that look identical or nearly identical to others, …

Zero Width Characters: What They Are and Why They Matter

Zero-width characters are invisible Unicode code points that affect text layout, joining, …

Unicode Whitespace Characters Guide

Unicode defines over two dozen whitespace characters beyond the ordinary space, including …

History of Unicode

Unicode began in 1987 as a collaboration between engineers at Apple and …

Unicode Versions Timeline

Unicode has released major versions regularly since 1.0 in 1991, with each …