What is Karakter penggabung?

Karakter yang menempel pada karakter dasar sebelumnya untuk memodifikasinya. Kategori Umum: Mn (nonspacing), Mc (spacing combining), Me (enclosing). Contoh: ◌́ (U+0301 Combining Acute).

Proses mengonversi teks Unicode ke dalam bentuk kanonik standar. Empat bentuk: NFC (terkomposisi), NFD (terdekomposisi), NFKC (kompatibilitas terkomposisi), NFKD (kompatibilitas terdekomposisi).

Properti

Kelas penggabungan

Nilai numerik (0–254) yang mengontrol pengurutan tanda penggabung selama dekomposisi kanonik, menentukan tanda penggabung mana yang dapat diatur ulang urutannya.

2022-02-21 · Updated 2024-06-11

What Is the Canonical Combining Class?

The Canonical Combining Class (CCC) is an integer property (range 0–240) assigned to every Unicode character. It specifies how combining marks—characters that attach to a preceding base character—are reordered relative to one another during Unicode Normalization. Most base characters and non-combining characters have CCC = 0 (Non-combining). Combining diacritical marks carry non-zero values that determine their stacking order.

The core rule is the Canonical Ordering Algorithm: when two adjacent combining marks both have non-zero CCC values, the one with the lower value is placed closer to the base character in the normalized form. Two marks with equal non-zero CCC values are considered blocked and their relative order is preserved.

CCC in Practice

Consider the letter a with two diacritics: a cedilla (CCC=202) and an ogonek (CCC=202 as well). Because they share the same CCC, their order is kept stable. But an above-combining mark like combining breve (CCC=228) and a below-combining mark like combining macron below (CCC=220) would sort by their values during normalization, placing the CCC=220 mark before the CCC=228 mark in NFD.

import unicodedata

marks = [
    ("\u0300", "COMBINING GRAVE ACCENT"),        # CCC=230
    ("\u0327", "COMBINING CEDILLA"),              # CCC=202
    ("\u0328", "COMBINING OGONEK"),               # CCC=202
    ("\u0331", "COMBINING MACRON BELOW"),         # CCC=220
    ("\u0952", "DEVANAGARI STRESS SIGN ANUDATTA"),# CCC=220
]

for char, name in marks:
    ccc = unicodedata.combining(char)
    print(f"  CCC={ccc:3}  {name}")

# CCC=230  COMBINING GRAVE ACCENT
# CCC=202  COMBINING CEDILLA
# CCC=202  COMBINING OGONEK
# CCC=220  COMBINING MACRON BELOW
# CCC=220  DEVANAGARI STRESS SIGN ANUDATTA

# Normalization puts the sequence into canonical order:
text = "a\u0328\u0300"   # a + ogonek (CCC=202) + grave (CCC=230)
nfd = unicodedata.normalize("NFD", text)
# NFD preserves order here because 202 < 230, ogonek stays first
print([f"U+{ord(c):04X}" for c in nfd])
# ['U+0061', 'U+0328', 'U+0300']

Named CCC Values

A few CCC values have names defined in the standard: 0 (Not_Reordered), 1 (Overlay), 6 (Han_Reading), 7 (Nukta), 8 (Kana_Voicing), 9 (Virama), and 10 (CCC10) through 199 (CCC199) for specific positioning classes. Values 200–240 are used for particular combining categories such as Below (CCC=220), Above (CCC=230), and Double_Below (CCC=233).

Quick Facts

Property	Value
Unicode property name	`Canonical_Combining_Class`
Short alias	`ccc`
Range	0–240 (not all values used)
Value 0	Base characters, non-combining
Python function	`unicodedata.combining(char)` → integer
Key use	NFD/NFC canonical ordering during normalization
Spec reference	Unicode Standard Section 3.11, UAX #15

Istilah Terkait

Karakter penggabung Normalisasi Kesetaraan kanonik

Lainnya di Properti

Alias nama

Nama alternatif untuk karakter, karena nama Unicode tidak dapat diubah sesuai kebijakan …

Blok

Rentang titik kode berurutan yang dinamai (misalnya, Basic Latin = U+0000–U+007F). Unicode …

Dapat diabaikan secara default

Karakter yang tidak memiliki efek visual dan dapat diabaikan oleh proses yang …

Dekomposisi

Pemetaan karakter ke bagian-bagian komponennya. Dekomposisi kanonik mempertahankan makna (é → e …

East Asian Width

Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …

Joining Type

Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …

Kategori dua arah

Properti yang menentukan bagaimana karakter berperilaku dalam teks dua arah (LTR, RTL, …

Kategori umum

Klasifikasi setiap titik kode ke dalam salah satu dari 30 kategori (Lu, …

Kesetaraan kanonik

Dua urutan karakter yang secara semantik identik dan harus diperlakukan sama. Contoh: …

Kesetaraan kompatibilitas

Dua urutan karakter dengan konten abstrak yang sama yang mungkin berbeda dalam …

← Kembali ke Glosarium