What is Chuẩn hóa?

Quá trình chuyển đổi văn bản Unicode sang dạng chuẩn chuẩn. Bốn dạng: NFC (đã hợp thành), NFD (đã phân tích), NFKC (tương thích đã hợp thành), NFKD (tương thích đã phân tích).

What is Tương đương chuẩn tắc?

Hai chuỗi ký tự có ngữ nghĩa giống hệt nhau và nên được xử lý như nhau. Ví dụ: é (U+00E9) ≡ e + ◌́ (U+0065 + U+0301).

What is Tương đương tương thích?

Hai chuỗi ký tự có cùng nội dung trừu tượng nhưng có thể khác về hình thức. Rộng hơn tương đương chuẩn. Ví dụ: ﬁ ≈ fi, ² ≈ 2.

Thuộc tính

Phân tách

Việc ánh xạ một ký tự thành các thành phần của nó. Phân tích chuẩn bảo toàn ý nghĩa (é → e + ́); phân tích tương thích có thể thay đổi nó (ﬁ → fi).

2022-03-02 · Updated 2024-11-25

What Is a Decomposition Mapping?

A decomposition mapping tells you how a Unicode character can be broken down into a sequence of simpler characters. There are two kinds:

Canonical decomposition: the character is identical in meaning and rendering to its decomposed sequence. For example, U+00E9 LATIN SMALL LETTER E WITH ACUTE (é) canonically decomposes to U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT.
Compatibility decomposition: the character is only compatible (semantically similar, possibly different appearance) with its decomposed sequence. For example, the ligature U+FB01 ﬁ (fi) compatibility-decomposes to U+0066 f + U+0069 i, and U+00B2 ² (superscript two) decomposes to U+0032 2.

Normalization Forms

The four Unicode Normalization Forms are defined in terms of decomposition and canonical composition:

Form	Decomposition	Composition
NFD	Canonical	No
NFC	Canonical	Yes (canonical)
NFKD	Compatibility	No
NFKC	Compatibility	Yes (canonical)

import unicodedata

samples = [
    ("\u00E9", "é  e+acute"),        # canonical
    ("\u00C5", "Å  A+ring"),         # canonical
    ("\uFB01", "ﬁ  fi ligature"),    # compatibility
    ("\u00B2", "²  superscript 2"),  # compatibility
    ("\u2126", "Ω  OHM SIGN"),       # canonical → U+03A9 GREEK CAPITAL OMEGA
]

for char, label in samples:
    raw = unicodedata.decomposition(char)
    nfd = unicodedata.normalize("NFD", char)
    nfkd = unicodedata.normalize("NFKD", char)
    nfc = unicodedata.normalize("NFC", nfd)
    print(f"  {label}")
    print(f"    decomposition() raw : {raw!r}")
    print(f"    NFD  : {[f'U+{ord(c):04X}' for c in nfd]}")
    print(f"    NFKD : {[f'U+{ord(c):04X}' for c in nfkd]}")
    print(f"    NFC  : {[f'U+{ord(c):04X}' for c in nfc]}")

The unicodedata.decomposition() function returns a raw string from UnicodeData.txt. A leading tag in angle brackets like <compat>, <font>, <circle>, <wide>, etc. indicates a compatibility decomposition; no tag means canonical.

Practical Implications

Search and indexing: NFKC normalization lets you match ﬁle against file or ２ against 2. Many search engines apply NFKC before indexing. Security: Compatibility decomposition can reveal confusable characters—U+2126 Ω and U+03A9 Ω look identical and are canonically equivalent, so an application that compares usernames should normalize first. Identifiers: Python 3 uses NFKC for identifier normalization (PEP 3131).

Quick Facts

Property	Value
Unicode property name	`Decomposition_Mapping`
Short alias	`dm`
Types	Canonical, Compatibility (13 tags: `<compat>`, `<font>`, `<circle>`, etc.)
Python function	`unicodedata.decomposition(char)` → raw string
Normalization function	`unicodedata.normalize(form, string)`
Forms	NFD, NFC, NFKD, NFKC
Spec reference	Unicode Standard Annex #15 (UAX #15)