What is การทำให้เป็นมาตรฐาน?

กระบวนการแปลงข้อความ Unicode เป็นรูปแบบ canonical มาตรฐาน มี 4 รูปแบบ: NFC (รวม), NFD (แยก), NFKC (compatibility รวม), NFKD (compatibility แยก)

คุณสมบัติ

การแยกส่วน

การแมปอักขระเป็นส่วนประกอบย่อย การแยกส่วนแบบ canonical รักษาความหมาย (é → e + ́) ในขณะที่การแยกส่วนแบบ compatibility อาจเปลี่ยนความหมาย (ﬁ → fi)

2022-03-02 · Updated 2024-11-25

What Is a Decomposition Mapping?

A decomposition mapping tells you how a Unicode character can be broken down into a sequence of simpler characters. There are two kinds:

Canonical decomposition: the character is identical in meaning and rendering to its decomposed sequence. For example, U+00E9 LATIN SMALL LETTER E WITH ACUTE (é) canonically decomposes to U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT.
Compatibility decomposition: the character is only compatible (semantically similar, possibly different appearance) with its decomposed sequence. For example, the ligature U+FB01 ﬁ (fi) compatibility-decomposes to U+0066 f + U+0069 i, and U+00B2 ² (superscript two) decomposes to U+0032 2.

Normalization Forms

The four Unicode Normalization Forms are defined in terms of decomposition and canonical composition:

Form	Decomposition	Composition
NFD	Canonical	No
NFC	Canonical	Yes (canonical)
NFKD	Compatibility	No
NFKC	Compatibility	Yes (canonical)

import unicodedata

samples = [
    ("\u00E9", "é  e+acute"),        # canonical
    ("\u00C5", "Å  A+ring"),         # canonical
    ("\uFB01", "ﬁ  fi ligature"),    # compatibility
    ("\u00B2", "²  superscript 2"),  # compatibility
    ("\u2126", "Ω  OHM SIGN"),       # canonical → U+03A9 GREEK CAPITAL OMEGA
]

for char, label in samples:
    raw = unicodedata.decomposition(char)
    nfd = unicodedata.normalize("NFD", char)
    nfkd = unicodedata.normalize("NFKD", char)
    nfc = unicodedata.normalize("NFC", nfd)
    print(f"  {label}")
    print(f"    decomposition() raw : {raw!r}")
    print(f"    NFD  : {[f'U+{ord(c):04X}' for c in nfd]}")
    print(f"    NFKD : {[f'U+{ord(c):04X}' for c in nfkd]}")
    print(f"    NFC  : {[f'U+{ord(c):04X}' for c in nfc]}")

The unicodedata.decomposition() function returns a raw string from UnicodeData.txt. A leading tag in angle brackets like <compat>, <font>, <circle>, <wide>, etc. indicates a compatibility decomposition; no tag means canonical.

Practical Implications

Search and indexing: NFKC normalization lets you match ﬁle against file or ２ against 2. Many search engines apply NFKC before indexing. Security: Compatibility decomposition can reveal confusable characters—U+2126 Ω and U+03A9 Ω look identical and are canonically equivalent, so an application that compares usernames should normalize first. Identifiers: Python 3 uses NFKC for identifier normalization (PEP 3131).

Quick Facts

Property	Value
Unicode property name	`Decomposition_Mapping`
Short alias	`dm`
Types	Canonical, Compatibility (13 tags: `<compat>`, `<font>`, `<circle>`, etc.)
Python function	`unicodedata.decomposition(char)` → raw string
Normalization function	`unicodedata.normalize(form, string)`
Forms	NFD, NFC, NFKD, NFKC
Spec reference	Unicode Standard Annex #15 (UAX #15)

คำศัพท์ที่เกี่ยวข้อง

การทำให้เป็นมาตรฐาน ความสมมูลมาตรฐาน ความสมมูลความเข้ากันได้

เพิ่มเติมใน คุณสมบัติ

East Asian Width

Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …

Joining Type

Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …

Script Extensions

Unicode property listing all scripts that use a character, broader than the …

กลุ่มกราฟีม

อักขระที่ผู้ใช้รับรู้ได้ — สิ่งที่รู้สึกเหมือนหน่วยเดียว อาจประกอบด้วยหลายจุดรหัส (ฐาน + เครื่องหมายรวม หรือลำดับ emoji ZWJ) 👩‍💻 = …

การแมปตัวพิมพ์

กฎสำหรับแปลงอักขระระหว่างตัวพิมพ์ใหญ่ ตัวพิมพ์เล็ก และตัวพิมพ์หัวเรื่อง อาจขึ้นอยู่กับ locale (ปัญหาตัว I ในภาษาตุรกี) และอาจเป็นแบบหนึ่ง-ต่อ-หลาย (ß → SS)

คลาสการรวม

ค่าตัวเลข (0–254) ที่ควบคุมลำดับของเครื่องหมายรวมระหว่างการแยกส่วนแบบ canonical กำหนดว่าเครื่องหมายรวมใดสามารถเรียงลำดับใหม่ได้

ความสมมูลความเข้ากันได้

ลำดับอักขระสองชุดที่มีเนื้อหาเชิงนามธรรมเดียวกันแต่อาจแตกต่างในรูปลักษณ์ กว้างกว่าความเท่าเทียมแบบ canonical ตัวอย่าง: ﬁ ≈ fi, ² ≈ 2

ความสมมูลมาตรฐาน

ลำดับอักขระสองชุดที่มีความหมายเหมือนกันและควรถือว่าเท่าเทียมกัน ตัวอย่าง: é (U+00E9) ≡ e + ◌́ (U+0065 + U+0301)

คุณสมบัติการสะท้อน

อักขระที่รูปร่างควรสะท้อนในแนวนอนในบริบท RTL ตัวอย่าง: ( → ), [ → ], { → }, …

คุณสมบัติเวอร์ชัน

เวอร์ชัน Unicode ที่มีการกำหนดอักขระเป็นครั้งแรก มีประโยชน์สำหรับการตรวจสอบการรองรับอักขระในระบบและซอฟต์แวร์เวอร์ชันต่างๆ

← กลับไปยังอภิธานศัพท์