What is อักขระรวม?

อักขระที่ติดกับอักขระฐานก่อนหน้าเพื่อปรับเปลี่ยนมัน หมวดหมู่ทั่วไป: Mn (nonspacing), Mc (spacing combining), Me (enclosing) ตัวอย่าง: ◌́ (U+0301 Combining Acute)

What is การทำให้เป็นมาตรฐาน?

กระบวนการแปลงข้อความ Unicode เป็นรูปแบบ canonical มาตรฐาน มี 4 รูปแบบ: NFC (รวม), NFD (แยก), NFKC (compatibility รวม), NFKD (compatibility แยก)

คุณสมบัติ

คลาสการรวม

ค่าตัวเลข (0–254) ที่ควบคุมลำดับของเครื่องหมายรวมระหว่างการแยกส่วนแบบ canonical กำหนดว่าเครื่องหมายรวมใดสามารถเรียงลำดับใหม่ได้

2022-02-21 · Updated 2024-06-11

What Is the Canonical Combining Class?

The Canonical Combining Class (CCC) is an integer property (range 0–240) assigned to every Unicode character. It specifies how combining marks—characters that attach to a preceding base character—are reordered relative to one another during Unicode Normalization. Most base characters and non-combining characters have CCC = 0 (Non-combining). Combining diacritical marks carry non-zero values that determine their stacking order.

The core rule is the Canonical Ordering Algorithm: when two adjacent combining marks both have non-zero CCC values, the one with the lower value is placed closer to the base character in the normalized form. Two marks with equal non-zero CCC values are considered blocked and their relative order is preserved.

CCC in Practice

Consider the letter a with two diacritics: a cedilla (CCC=202) and an ogonek (CCC=202 as well). Because they share the same CCC, their order is kept stable. But an above-combining mark like combining breve (CCC=228) and a below-combining mark like combining macron below (CCC=220) would sort by their values during normalization, placing the CCC=220 mark before the CCC=228 mark in NFD.

import unicodedata

marks = [
    ("\u0300", "COMBINING GRAVE ACCENT"),        # CCC=230
    ("\u0327", "COMBINING CEDILLA"),              # CCC=202
    ("\u0328", "COMBINING OGONEK"),               # CCC=202
    ("\u0331", "COMBINING MACRON BELOW"),         # CCC=220
    ("\u0952", "DEVANAGARI STRESS SIGN ANUDATTA"),# CCC=220
]

for char, name in marks:
    ccc = unicodedata.combining(char)
    print(f"  CCC={ccc:3}  {name}")

# CCC=230  COMBINING GRAVE ACCENT
# CCC=202  COMBINING CEDILLA
# CCC=202  COMBINING OGONEK
# CCC=220  COMBINING MACRON BELOW
# CCC=220  DEVANAGARI STRESS SIGN ANUDATTA

# Normalization puts the sequence into canonical order:
text = "a\u0328\u0300"   # a + ogonek (CCC=202) + grave (CCC=230)
nfd = unicodedata.normalize("NFD", text)
# NFD preserves order here because 202 < 230, ogonek stays first
print([f"U+{ord(c):04X}" for c in nfd])
# ['U+0061', 'U+0328', 'U+0300']

Named CCC Values

A few CCC values have names defined in the standard: 0 (Not_Reordered), 1 (Overlay), 6 (Han_Reading), 7 (Nukta), 8 (Kana_Voicing), 9 (Virama), and 10 (CCC10) through 199 (CCC199) for specific positioning classes. Values 200–240 are used for particular combining categories such as Below (CCC=220), Above (CCC=230), and Double_Below (CCC=233).

Quick Facts

Property	Value
Unicode property name	`Canonical_Combining_Class`
Short alias	`ccc`
Range	0–240 (not all values used)
Value 0	Base characters, non-combining
Python function	`unicodedata.combining(char)` → integer
Key use	NFD/NFC canonical ordering during normalization
Spec reference	Unicode Standard Section 3.11, UAX #15

คำศัพท์ที่เกี่ยวข้อง

อักขระรวม การทำให้เป็นมาตรฐาน ความสมมูลมาตรฐาน

เพิ่มเติมใน คุณสมบัติ

East Asian Width

Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …

Joining Type

Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …

Script Extensions

Unicode property listing all scripts that use a character, broader than the …

กลุ่มกราฟีม

อักขระที่ผู้ใช้รับรู้ได้ — สิ่งที่รู้สึกเหมือนหน่วยเดียว อาจประกอบด้วยหลายจุดรหัส (ฐาน + เครื่องหมายรวม หรือลำดับ emoji ZWJ) 👩‍💻 = …

การแมปตัวพิมพ์

กฎสำหรับแปลงอักขระระหว่างตัวพิมพ์ใหญ่ ตัวพิมพ์เล็ก และตัวพิมพ์หัวเรื่อง อาจขึ้นอยู่กับ locale (ปัญหาตัว I ในภาษาตุรกี) และอาจเป็นแบบหนึ่ง-ต่อ-หลาย (ß → SS)

การแยกส่วน

การแมปอักขระเป็นส่วนประกอบย่อย การแยกส่วนแบบ canonical รักษาความหมาย (é → e + ́) ในขณะที่การแยกส่วนแบบ compatibility อาจเปลี่ยนความหมาย …

ความสมมูลความเข้ากันได้

ลำดับอักขระสองชุดที่มีเนื้อหาเชิงนามธรรมเดียวกันแต่อาจแตกต่างในรูปลักษณ์ กว้างกว่าความเท่าเทียมแบบ canonical ตัวอย่าง: ﬁ ≈ fi, ² ≈ 2

ความสมมูลมาตรฐาน

ลำดับอักขระสองชุดที่มีความหมายเหมือนกันและควรถือว่าเท่าเทียมกัน ตัวอย่าง: é (U+00E9) ≡ e + ◌́ (U+0065 + U+0301)

คุณสมบัติการสะท้อน

อักขระที่รูปร่างควรสะท้อนในแนวนอนในบริบท RTL ตัวอย่าง: ( → ), [ → ], { → }, …

คุณสมบัติเวอร์ชัน

เวอร์ชัน Unicode ที่มีการกำหนดอักขระเป็นครั้งแรก มีประโยชน์สำหรับการตรวจสอบการรองรับอักขระในระบบและซอฟต์แวร์เวอร์ชันต่างๆ

← กลับไปยังอภิธานศัพท์