Grupo de grafemas
O "caractere" percebido pelo usuário — o que parece uma única unidade. Pode consistir em vários pontos de código (base + marcas de combinação, ou sequências ZWJ de emoji). 👩💻 = 3 pontos de código, 1 grafema.
What Is a Grapheme Cluster?
A grapheme cluster is what a user perceives as a single "character" on screen—what you see when you tap the delete key once or advance the cursor by one position. Unicode code points and user-perceived characters are not the same: a single grapheme cluster may consist of multiple code points.
The Unicode Standard defines Extended Grapheme Clusters (EGC) in Unicode Standard Annex #29. The boundary rules specify when adjacent code points form a single cluster:
- A base character plus any number of combining marks (e.g., a + ́ → á)
- Hangul syllable sequences (L + V + T clusters, e.g., ㄱ + ㅏ + ㄴ → 간)
- Emoji modifier sequences (👋 + skin-tone modifier 🏽 → 👋🏽)
- Flag sequences (regional indicator J + P → 🇯🇵)
- Emoji ZWJ sequences (👨 + ZWJ + 👩 + ZWJ + 👧 → 👨👩👧)
- Emoji presentation sequences (digit + U+FE0F variation selector → 1️⃣)
Grapheme Cluster Iteration in Python
# Python's len() counts code points, not grapheme clusters
flag = "\U0001F1EF\U0001F1F5" # 🇯🇵 Japan flag (J + P regional indicators)
print(len(flag)) # 2 code points
family = "\U0001F468\u200D\U0001F469\u200D\U0001F467" # 👨👩👧
print(len(family)) # 5 code points
# For grapheme-aware string operations, use the 'grapheme' package
# pip install grapheme
try:
import grapheme
print(grapheme.length(flag)) # 1
print(grapheme.length(family)) # 1
# Grapheme-safe slicing
text = "café" # If stored as c + a + f + e + combining acute
nfd = "cafe\u0301"
print(grapheme.length(nfd)) # 4 (user sees 4 characters)
print(len(nfd)) # 5 (5 code points)
except ImportError:
print("Install 'grapheme' package for EGC support")
# Hangul example
hangul = "\u0067\u0041\u002F" # Not Hangul — just example of combining
syllable = "\uAC00" # 가 — precomposed Hangul syllable
jamo = "\u1100\u1161" # ᄀ + ᅡ — jamo sequence = same grapheme
import unicodedata
print(unicodedata.normalize("NFC", jamo) == syllable) # True
Why Grapheme Clusters Matter
Cursor movement and text editing: A text editor must advance the cursor by one grapheme cluster, not one code point. Moving one code point through 👋🏽 (two code points) would split the emoji in half, leaving a broken sequence.
String truncation: Truncating a string to 10 "characters" for display must count grapheme clusters. text[:10] in Python counts code points and may split an emoji sequence.
Regular expressions: The regex \X in PCRE2 and the Python regex package matches a single extended grapheme cluster, enabling grapheme-aware patterns.
Quick Facts
| Property | Value |
|---|---|
| Concept | Extended Grapheme Cluster (EGC) |
| Defined by | Unicode Standard Annex #29 (UAX #29) |
| Python built-in | No (use grapheme package or regex \X) |
| Common pitfall | len(s) counts code points, not clusters |
| Emoji clusters | ZWJ sequences, modifier sequences, flag sequences |
| Hangul | Jamo sequences form single grapheme clusters |
Termos Relacionados
Mais em Propriedades
Nomes alternativos para caracteres, pois os nomes Unicode não podem mudar conforme …
Intervalo contíguo nomeado de pontos de código (por exemplo, Basic Latin = …
Propriedade que determina como um caractere se comporta em texto bidirecional (LTR, …
Classificação de cada ponto de código em uma das 30 categorias (Lu, …
Valor numérico (0–254) que controla a ordenação de marcas de combinação durante …
O mapeamento de um caractere para suas partes componentes. A decomposição canônica …
Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …
Duas sequências de caracteres que são semanticamente idênticas e devem ser tratadas …
Duas sequências de caracteres com o mesmo conteúdo abstrato que podem diferir …
Caracteres que não devem ter nenhum efeito visível e podem ser ignorados …