What is 规范等价?

语义上完全相同、应被视为等价的两个字符序列，例如é（U+00E9）≡ e + ◌́（U+0065 + U+0301）。

What is Unicode 规范化?

将Unicode文本转换为标准规范形式的过程，包含四种形式：NFC（合成）、NFD（分解）、NFKC（兼容合成）、NFKD（兼容分解）。

What is NFKC (Compatibility Composition)?

规范化形式KC：兼容分解后再规范合成，合并视觉上相似的字符（ﬁ→fi、²→2、Ⅳ→IV），用于标识符比较。

What is NFKD (Compatibility Decomposition)?

规范化形式KD：兼容分解而不重新合成，是最激进的规范化方式，会丢失最多的格式信息。

字符属性

兼容等价

具有相同抽象内容但外观可能不同的两个字符序列，比规范等价更宽泛，例如ﬁ ≈ fi，² ≈ 2。

2022-05-16 · Updated 2024-10-01

What Is Compatibility Equivalence?

Two Unicode strings are compatibility equivalent if they represent semantically similar content but may differ in appearance or formatting. Compatibility equivalence is weaker than canonical equivalence: canonically equivalent strings are always compatibility equivalent, but not vice versa.

Common compatibility equivalences include:

The ligature ﬁ (U+FB01, fi LIGATURE) ≈ fi (f + i separately)
The superscript ² (U+00B2) ≈ 2 (U+0032)
The fullwidth Ａ (U+FF21) ≈ A (U+0041)
The fraction ½ (U+00BD) in NFKD → 1 ⁄ 2 (sequence of three characters)
The circled digit ① (U+2460) ≈ 1 (U+0031)

Compatibility Normalization Forms

Form	Description
NFKD	Apply compatibility decomposition; apply canonical ordering
NFKC	Apply NFKD, then canonically compose

import unicodedata

examples = [
    ("\uFB01", "fi ligature"),         # ﬁ
    ("\u00B2", "superscript 2"),       # ²
    ("\uFF21", "fullwidth A"),         # Ａ
    ("\u2460", "circled digit 1"),     # ①
    ("\u00BD", "vulgar fraction 1/2"), # ½
]

for char, label in examples:
    nfc  = unicodedata.normalize("NFC",  char)
    nfkc = unicodedata.normalize("NFKC", char)
    nfd  = unicodedata.normalize("NFD",  char)
    nfkd = unicodedata.normalize("NFKD", char)
    print(f"  {char}  ({label})")
    print(f"    NFC  len={len(nfc)}   NFKC={nfkc!r} len={len(nfkc)}")
    print(f"    NFD  len={len(nfd)}   NFKD={[f'U+{ord(c):04X}' for c in nfkd]}")

# ﬁ  NFC len=1  NFKC='fi' len=2
# ²  NFC len=1  NFKC='2'  len=1
# Ａ  NFC len=1  NFKC='A'  len=1
# ①  NFC len=1  NFKC='1'  len=1

When to Use NFKC vs NFC

Use NFC when you want to preserve formatting distinctions: a superscript 2 and a plain 2 are different in a math formula. Use NFKC when you want semantic comparison, ignoring presentational variants: a search engine should return results for "fi" when the user types "ﬁle". Python uses NFKC for identifier normalization (PEP 3131), so ﬁle and file are the same identifier in Python 3.

Caution: NFKC is lossy. Applying it to 2² produces 22, discarding the superscript meaning. Never apply NFKC to content where formatting carries semantic information.

Quick Facts

Property	Value
Concept	Compatibility equivalence
Normalization forms	NFKD, NFKC
Python function	`unicodedata.normalize("NFKC", s)` / `"NFKD"`
Lossy?	Yes — formatting distinctions are discarded
Python identifier normalization	NFKC (PEP 3131)
Search engine use	NFKC for case-folded token normalization
Spec reference	Unicode Standard Annex #15 (UAX #15)