What is Kesetaraan kompatibilitas?

Dua urutan karakter dengan konten abstrak yang sama yang mungkin berbeda dalam penampilan. Lebih luas dari ekivalensi kanonik. Contoh: ﬁ ≈ fi, ² ≈ 2.

Proses mengonversi teks Unicode ke dalam bentuk kanonik standar. Empat bentuk: NFC (terkomposisi), NFD (terdekomposisi), NFKC (kompatibilitas terkomposisi), NFKD (kompatibilitas terdekomposisi).

What is NFC (Canonical Composition)?

Normalization Form C: dekomposisi lalu rekomposisi secara kanonik, menghasilkan bentuk terpendek. Direkomendasikan untuk penyimpanan dan pertukaran data; bentuk standar web.

What is NFD (Canonical Decomposition)?

Normalization Form D: dekomposisi penuh tanpa rekomposisi. Digunakan oleh sistem file macOS HFS+. é (U+00E9) → e + ◌́ (U+0065 + U+0301).

Properti

Kesetaraan kanonik

Dua urutan karakter yang secara semantik identik dan harus diperlakukan sama. Contoh: é (U+00E9) ≡ e + ◌́ (U+0065 + U+0301).

2022-05-02 · Updated 2024-05-23

What Is Canonical Equivalence?

Two Unicode strings are canonically equivalent if they represent the same abstract character sequence and should be treated as identical in all Unicode-conforming operations. They look the same, are pronounced the same, and have the same semantic value—the only difference is how the code points are arranged.

The canonical equivalence. The most common example is a precomposed character versus a base letter followed by a combining diacritic:

U+00F1 LATIN SMALL LETTER N WITH TILDE (ñ) — a single code point
U+006E LATIN SMALL LETTER N + U+0303 COMBINING TILDE — two code points

These two sequences are canonically equivalent. They must render identically and compare as equal after normalization.

Canonical Normalization Forms

Unicode defines two canonical normalization forms:

Form	Description
NFD (Canonical Decomposition)	Break all precomposed characters into base + combining marks; apply canonical ordering
NFC (Canonical Composition)	Apply NFD, then recompose into precomposed characters where possible

import unicodedata

# Two ways to write Spanish "ñ"
precomposed = "\u00F1"          # ñ as single code point
decomposed  = "\u006E\u0303"    # n + combining tilde

# They look the same:
print(precomposed, decomposed)
# ñ ñ

# But they are NOT equal as raw Python strings:
print(precomposed == decomposed)
# False
print(len(precomposed), len(decomposed))
# 1 2

# After NFC normalization they are equal:
nfc_pre = unicodedata.normalize("NFC", precomposed)
nfc_dec = unicodedata.normalize("NFC", decomposed)
print(nfc_pre == nfc_dec)
# True

# After NFD normalization they are also equal:
nfd_pre = unicodedata.normalize("NFD", precomposed)
nfd_dec = unicodedata.normalize("NFD", decomposed)
print(nfd_pre == nfd_dec)
# True
print(len(nfd_pre), len(nfd_dec))
# 2 2  (both are now decomposed)

Why This Matters

String comparison: Any application that compares user input against stored data must normalize both sides to the same form. Passwords, usernames, and search queries can silently differ due to canonical equivalence. The Python unicodedata.normalize("NFC", s) call is the standard fix.

Database storage: PostgreSQL uses NFC internally for text; MySQL's behavior depends on collation. Storing NFD strings in a NFC-collating database can cause subtle lookup failures.

File systems: macOS HFS+ normalizes filenames to NFD; Windows NTFS and Linux ext4 are normalization-agnostic. A file named ñ.txt may be stored differently on different systems, causing sync tools to create duplicates.

Quick Facts

Property	Value
Concept	Canonical equivalence
Normalization forms	NFD, NFC
Python function	`unicodedata.normalize("NFC", s)` / `"NFD"`
Common pitfall	Comparing strings without normalizing first
Opposite concept	Compatibility equivalence (looser, NFKD/NFKC)
Spec reference	Unicode Standard Annex #15 (UAX #15)

Istilah Terkait

Kesetaraan kompatibilitas Normalisasi NFC (Canonical Composition) NFD (Canonical Decomposition)

Lainnya di Properti

Alias nama

Nama alternatif untuk karakter, karena nama Unicode tidak dapat diubah sesuai kebijakan …

Blok

Rentang titik kode berurutan yang dinamai (misalnya, Basic Latin = U+0000–U+007F). Unicode …

Dapat diabaikan secara default

Karakter yang tidak memiliki efek visual dan dapat diabaikan oleh proses yang …

Dekomposisi

Pemetaan karakter ke bagian-bagian komponennya. Dekomposisi kanonik mempertahankan makna (é → e …

East Asian Width

Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …

Joining Type

Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …

Kategori dua arah

Properti yang menentukan bagaimana karakter berperilaku dalam teks dua arah (LTR, RTL, …

Kategori umum

Klasifikasi setiap titik kode ke dalam salah satu dari 30 kategori (Lu, …

Kelas penggabungan

Nilai numerik (0–254) yang mengontrol pengurutan tanda penggabung selama dekomposisi kanonik, menentukan …

Kesetaraan kompatibilitas

Dua urutan karakter dengan konten abstrak yang sama yang mungkin berbeda dalam …

← Kembali ke Glosarium