正規等価
意味的に同一で等価として扱われるべき2つの文字シーケンス。例:é(U+00E9)≡ e + ◌́(U+0065 + U+0301)。
What Is Canonical Equivalence?
Two Unicode strings are canonically equivalent if they represent the same abstract character sequence and should be treated as identical in all Unicode-conforming operations. They look the same, are pronounced the same, and have the same semantic value—the only difference is how the code points are arranged.
The canonical equivalence. The most common example is a precomposed character versus a base letter followed by a combining diacritic:
- U+00F1 LATIN SMALL LETTER N WITH TILDE (ñ) — a single code point
- U+006E LATIN SMALL LETTER N + U+0303 COMBINING TILDE — two code points
These two sequences are canonically equivalent. They must render identically and compare as equal after normalization.
Canonical Normalization Forms
Unicode defines two canonical normalization forms:
| Form | Description |
|---|---|
| NFD (Canonical Decomposition) | Break all precomposed characters into base + combining marks; apply canonical ordering |
| NFC (Canonical Composition) | Apply NFD, then recompose into precomposed characters where possible |
import unicodedata
# Two ways to write Spanish "ñ"
precomposed = "\u00F1" # ñ as single code point
decomposed = "\u006E\u0303" # n + combining tilde
# They look the same:
print(precomposed, decomposed)
# ñ ñ
# But they are NOT equal as raw Python strings:
print(precomposed == decomposed)
# False
print(len(precomposed), len(decomposed))
# 1 2
# After NFC normalization they are equal:
nfc_pre = unicodedata.normalize("NFC", precomposed)
nfc_dec = unicodedata.normalize("NFC", decomposed)
print(nfc_pre == nfc_dec)
# True
# After NFD normalization they are also equal:
nfd_pre = unicodedata.normalize("NFD", precomposed)
nfd_dec = unicodedata.normalize("NFD", decomposed)
print(nfd_pre == nfd_dec)
# True
print(len(nfd_pre), len(nfd_dec))
# 2 2 (both are now decomposed)
Why This Matters
String comparison: Any application that compares user input against stored data must normalize both sides to the same form. Passwords, usernames, and search queries can silently differ due to canonical equivalence. The Python unicodedata.normalize("NFC", s) call is the standard fix.
Database storage: PostgreSQL uses NFC internally for text; MySQL's behavior depends on collation. Storing NFD strings in a NFC-collating database can cause subtle lookup failures.
File systems: macOS HFS+ normalizes filenames to NFD; Windows NTFS and Linux ext4 are normalization-agnostic. A file named ñ.txt may be stored differently on different systems, causing sync tools to create duplicates.
Quick Facts
| Property | Value |
|---|---|
| Concept | Canonical equivalence |
| Normalization forms | NFD, NFC |
| Python function | unicodedata.normalize("NFC", s) / "NFD" |
| Common pitfall | Comparing strings without normalizing first |
| Opposite concept | Compatibility equivalence (looser, NFKD/NFKC) |
| Spec reference | Unicode Standard Annex #15 (UAX #15) |
関連用語
プロパティ のその他の用語
文字が最初に割り当てられたUnicodeバージョン。システムやソフトウェアバージョン間での文字サポートを判断するのに役立ちます。
Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …
Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …
Unicode property listing all scripts that use a character, broader than the …
文字を大文字・小文字・タイトルケースに変換するルール。ロケール依存の場合があり(トルコ語のI問題)、1対多のマッピングもあります(ß → SS)。
文字が属する文字体系(例:ラテン、キリル、漢字)。Unicode 16.0は168個のスクリプトを定義し、Scriptプロパティはセキュリティと混在スクリプト検出に重要です。
サポートしていないプロセスで目に見える効果なく無視できる文字で、異体字セレクター・ゼロ幅文字・言語タグなどが含まれます。
名前付きの連続したコードポイント範囲(例:基本ラテン = U+0000〜U+007F)。Unicode 16.0は336個のブロックを定義し、すべてのコードポイントはちょうど1つのブロックに属します。
RTLコンテキストでグリフを水平に反転すべき文字。例:( → )、[ → ]、{ → }、« → »。
すべてのコードポイントを30個のカテゴリ(Lu・Ll・Nd・Soなど)の1つに分類する体系で、7つの主要クラス(文字・記号・数字・句読点・記号・区切り・その他)にグループ化されています。