What is Caractère de combinaison?

Un caractère qui s'attache au caractère de base précédent pour le modifier. Catégorie générale : Mn (non-espacement), Mc (combinaison avec espacement), Me (encerclement). Exemple : ◌́ (U+0301 Accent aigu combinant).

What is Normalisation?

Processus de conversion du texte Unicode en une forme canonique standard. Quatre formes : NFC (composée), NFD (décomposée), NFKC (compatibilité composée), NFKD (compatibilité décomposée).

What is Équivalence canonique?

Deux séquences de caractères sémantiquement identiques qui doivent être traitées comme égales. Exemple : é (U+00E9) ≡ e + ◌́ (U+0065 + U+0301).

Propriétés

Classe de combinaison

Valeur numérique (0–254) contrôlant l'ordre des marques combinantes lors de la décomposition canonique, déterminant quelles marques peuvent être réordonnées.

2022-02-21 · Updated 2024-06-11

What Is the Canonical Combining Class?

The Canonical Combining Class (CCC) is an integer property (range 0–240) assigned to every Unicode character. It specifies how combining marks—characters that attach to a preceding base character—are reordered relative to one another during Unicode Normalization. Most base characters and non-combining characters have CCC = 0 (Non-combining). Combining diacritical marks carry non-zero values that determine their stacking order.

The core rule is the Canonical Ordering Algorithm: when two adjacent combining marks both have non-zero CCC values, the one with the lower value is placed closer to the base character in the normalized form. Two marks with equal non-zero CCC values are considered blocked and their relative order is preserved.

CCC in Practice

Consider the letter a with two diacritics: a cedilla (CCC=202) and an ogonek (CCC=202 as well). Because they share the same CCC, their order is kept stable. But an above-combining mark like combining breve (CCC=228) and a below-combining mark like combining macron below (CCC=220) would sort by their values during normalization, placing the CCC=220 mark before the CCC=228 mark in NFD.

import unicodedata

marks = [
    ("\u0300", "COMBINING GRAVE ACCENT"),        # CCC=230
    ("\u0327", "COMBINING CEDILLA"),              # CCC=202
    ("\u0328", "COMBINING OGONEK"),               # CCC=202
    ("\u0331", "COMBINING MACRON BELOW"),         # CCC=220
    ("\u0952", "DEVANAGARI STRESS SIGN ANUDATTA"),# CCC=220
]

for char, name in marks:
    ccc = unicodedata.combining(char)
    print(f"  CCC={ccc:3}  {name}")

# CCC=230  COMBINING GRAVE ACCENT
# CCC=202  COMBINING CEDILLA
# CCC=202  COMBINING OGONEK
# CCC=220  COMBINING MACRON BELOW
# CCC=220  DEVANAGARI STRESS SIGN ANUDATTA

# Normalization puts the sequence into canonical order:
text = "a\u0328\u0300"   # a + ogonek (CCC=202) + grave (CCC=230)
nfd = unicodedata.normalize("NFD", text)
# NFD preserves order here because 202 < 230, ogonek stays first
print([f"U+{ord(c):04X}" for c in nfd])
# ['U+0061', 'U+0328', 'U+0300']

Named CCC Values

A few CCC values have names defined in the standard: 0 (Not_Reordered), 1 (Overlay), 6 (Han_Reading), 7 (Nukta), 8 (Kana_Voicing), 9 (Virama), and 10 (CCC10) through 199 (CCC199) for specific positioning classes. Values 200–240 are used for particular combining categories such as Below (CCC=220), Above (CCC=230), and Double_Below (CCC=233).

Quick Facts

Property	Value
Unicode property name	`Canonical_Combining_Class`
Short alias	`ccc`
Range	0–240 (not all values used)
Value 0	Base characters, non-combining
Python function	`unicodedata.combining(char)` → integer
Key use	NFD/NFC canonical ordering during normalization
Spec reference	Unicode Standard Section 3.11, UAX #15

Termes associés

Caractère de combinaison Normalisation Équivalence canonique

Plus dans Propriétés

Alias de nom

Noms alternatifs pour les caractères, les noms Unicode ne pouvant pas changer …

Bloc

Plage contiguë nommée de points de code (par ex. Basic Latin = …

Catégorie bidirectionnelle

Propriété déterminant le comportement d'un caractère dans un texte bidirectionnel (LTR, RTL, …

Catégorie générale

Classification de chaque point de code dans l'une des 30 catégories (Lu, …

Correspondance de casse

Règles de conversion des caractères entre majuscules, minuscules et casse de titre. …

Décomposition

La décomposition d'un caractère en ses éléments constitutifs. La décomposition canonique préserve …

East Asian Width

Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …

Grappe de graphèmes

Le « caractère » perçu par l'utilisateur — ce qui ressemble à …

Ignorable par défaut

Caractères ne devant avoir aucun effet visible et pouvant être ignorés par …

Joining Type

Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …

← Retour au glossaire