组合类
控制正则分解过程中组合标记排序的数值(0–254),决定哪些组合标记可以重新排序。
What Is the Canonical Combining Class?
The Canonical Combining Class (CCC) is an integer property (range 0–240) assigned to every Unicode character. It specifies how combining marks—characters that attach to a preceding base character—are reordered relative to one another during Unicode Normalization. Most base characters and non-combining characters have CCC = 0 (Non-combining). Combining diacritical marks carry non-zero values that determine their stacking order.
The core rule is the Canonical Ordering Algorithm: when two adjacent combining marks both have non-zero CCC values, the one with the lower value is placed closer to the base character in the normalized form. Two marks with equal non-zero CCC values are considered blocked and their relative order is preserved.
CCC in Practice
Consider the letter a with two diacritics: a cedilla (CCC=202) and an ogonek (CCC=202 as well). Because they share the same CCC, their order is kept stable. But an above-combining mark like combining breve (CCC=228) and a below-combining mark like combining macron below (CCC=220) would sort by their values during normalization, placing the CCC=220 mark before the CCC=228 mark in NFD.
import unicodedata
marks = [
("\u0300", "COMBINING GRAVE ACCENT"), # CCC=230
("\u0327", "COMBINING CEDILLA"), # CCC=202
("\u0328", "COMBINING OGONEK"), # CCC=202
("\u0331", "COMBINING MACRON BELOW"), # CCC=220
("\u0952", "DEVANAGARI STRESS SIGN ANUDATTA"),# CCC=220
]
for char, name in marks:
ccc = unicodedata.combining(char)
print(f" CCC={ccc:3} {name}")
# CCC=230 COMBINING GRAVE ACCENT
# CCC=202 COMBINING CEDILLA
# CCC=202 COMBINING OGONEK
# CCC=220 COMBINING MACRON BELOW
# CCC=220 DEVANAGARI STRESS SIGN ANUDATTA
# Normalization puts the sequence into canonical order:
text = "a\u0328\u0300" # a + ogonek (CCC=202) + grave (CCC=230)
nfd = unicodedata.normalize("NFD", text)
# NFD preserves order here because 202 < 230, ogonek stays first
print([f"U+{ord(c):04X}" for c in nfd])
# ['U+0061', 'U+0328', 'U+0300']
Named CCC Values
A few CCC values have names defined in the standard: 0 (Not_Reordered), 1 (Overlay), 6 (Han_Reading), 7 (Nukta), 8 (Kana_Voicing), 9 (Virama), and 10 (CCC10) through 199 (CCC199) for specific positioning classes. Values 200–240 are used for particular combining categories such as Below (CCC=220), Above (CCC=230), and Double_Below (CCC=233).
Quick Facts
| Property | Value |
|---|---|
| Unicode property name | Canonical_Combining_Class |
| Short alias | ccc |
| Range | 0–240 (not all values used) |
| Value 0 | Base characters, non-combining |
| Python function | unicodedata.combining(char) → integer |
| Key use | NFD/NFC canonical ordering during normalization |
| Spec reference | Unicode Standard Section 3.11, UAX #15 |
相关术语
字符属性 中的更多内容
字符首次被分配时所在的Unicode版本,有助于判断各系统和软件版本的字符支持情况。
Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …
Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …
Unicode property listing all scripts that use a character, broader than the …
将每个码位归入30个类别(Lu、Ll、Nd、So等)之一的分类体系,分为7大类:字母、标记、数字、标点、符号、分隔符和其他。
具有相同抽象内容但外观可能不同的两个字符序列,比规范等价更宽泛,例如fi ≈ fi,² ≈ 2。
将字符映射为其组成部分的过程。规范分解保留语义(é → e + ◌́),兼容分解可能改变语义(fi → fi)。
命名的连续码位范围(如基本拉丁文 = U+0000–U+007F)。Unicode 16.0定义了336个区块,每个码位恰好属于一个区块。
决定字符在双向文本中(LTR、RTL、弱、中性)行为方式的属性,由Unicode双向算法用于确定显示顺序。
由于稳定性策略规定Unicode名称不可更改,因此提供字符的备用名称,用于更正、缩写和别名。