What is Unicode Normalization?

Process of converting Unicode text to a standard canonical form. Four forms: NFC (composed), NFD (decomposed), NFKC (compatibility composed), NFKD (compatibility decomposed).

What is Compatibility Equivalence?

Two character sequences with the same abstract content that may differ in appearance. Broader than canonical equivalence. Example: ﬁ ≈ fi, ² ≈ 2.

Properties

Decomposition

The mapping of a character to its component parts. Canonical decomposition preserves meaning (é → e + ́); compatibility decomposition may change it (ﬁ → fi).

2022-03-02 · Updated 2024-11-25

What Is a Decomposition Mapping?

A decomposition mapping tells you how a Unicode character can be broken down into a sequence of simpler characters. There are two kinds:

Canonical decomposition: the character is identical in meaning and rendering to its decomposed sequence. For example, U+00E9 LATIN SMALL LETTER E WITH ACUTE (é) canonically decomposes to U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT.
Compatibility decomposition: the character is only compatible (semantically similar, possibly different appearance) with its decomposed sequence. For example, the ligature U+FB01 ﬁ (fi) compatibility-decomposes to U+0066 f + U+0069 i, and U+00B2 ² (superscript two) decomposes to U+0032 2.

Normalization Forms

The four Unicode Normalization Forms are defined in terms of decomposition and canonical composition:

Form	Decomposition	Composition
NFD	Canonical	No
NFC	Canonical	Yes (canonical)
NFKD	Compatibility	No
NFKC	Compatibility	Yes (canonical)

import unicodedata

samples = [
    ("\u00E9", "é  e+acute"),        # canonical
    ("\u00C5", "Å  A+ring"),         # canonical
    ("\uFB01", "ﬁ  fi ligature"),    # compatibility
    ("\u00B2", "²  superscript 2"),  # compatibility
    ("\u2126", "Ω  OHM SIGN"),       # canonical → U+03A9 GREEK CAPITAL OMEGA
]

for char, label in samples:
    raw = unicodedata.decomposition(char)
    nfd = unicodedata.normalize("NFD", char)
    nfkd = unicodedata.normalize("NFKD", char)
    nfc = unicodedata.normalize("NFC", nfd)
    print(f"  {label}")
    print(f"    decomposition() raw : {raw!r}")
    print(f"    NFD  : {[f'U+{ord(c):04X}' for c in nfd]}")
    print(f"    NFKD : {[f'U+{ord(c):04X}' for c in nfkd]}")
    print(f"    NFC  : {[f'U+{ord(c):04X}' for c in nfc]}")

The unicodedata.decomposition() function returns a raw string from UnicodeData.txt. A leading tag in angle brackets like <compat>, <font>, <circle>, <wide>, etc. indicates a compatibility decomposition; no tag means canonical.

Practical Implications

Search and indexing: NFKC normalization lets you match ﬁle against file or ２ against 2. Many search engines apply NFKC before indexing. Security: Compatibility decomposition can reveal confusable characters—U+2126 Ω and U+03A9 Ω look identical and are canonically equivalent, so an application that compares usernames should normalize first. Identifiers: Python 3 uses NFKC for identifier normalization (PEP 3131).

Quick Facts

Property	Value
Unicode property name	`Decomposition_Mapping`
Short alias	`dm`
Types	Canonical, Compatibility (13 tags: `<compat>`, `<font>`, `<circle>`, etc.)
Python function	`unicodedata.decomposition(char)` → raw string
Normalization function	`unicodedata.normalize(form, string)`
Forms	NFD, NFC, NFKD, NFKC
Spec reference	Unicode Standard Annex #15 (UAX #15)

Related Terms

Unicode Normalization Canonical Equivalence Compatibility Equivalence

More in Properties

Age Property

The Unicode version in which a character was first assigned. Useful for …

Bidirectional Category

Property determining how a character behaves in bidirectional text (LTR, RTL, weak, …

Block

A named contiguous range of code points (e.g., Basic Latin = U+0000–U+007F). …

Canonical Equivalence

Two character sequences that are semantically identical and should be treated as …

Case Mapping

The rules for converting characters between uppercase, lowercase, and titlecase. Can be …

Combining Class

Numeric value (0–254) controlling the ordering of combining marks during canonical decomposition, …

Compatibility Equivalence

Two character sequences with the same abstract content that may differ in …

Default Ignorable

Characters that should have no visible effect and can be ignored by …

East Asian Width

Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …

Extended Grapheme Cluster

The user-perceived 'character' — what feels like a single unit. May consist …

← Back to Glossary