Kluster grafem
Karakter yang dipersepsikan pengguna — yang terasa seperti satu unit. Mungkin terdiri dari beberapa titik kode (basis + tanda penggabung, atau urutan emoji ZWJ). 👩💻 = 3 titik kode, 1 graphem.
What Is a Grapheme Cluster?
A grapheme cluster is what a user perceives as a single "character" on screen—what you see when you tap the delete key once or advance the cursor by one position. Unicode code points and user-perceived characters are not the same: a single grapheme cluster may consist of multiple code points.
The Unicode Standard defines Extended Grapheme Clusters (EGC) in Unicode Standard Annex #29. The boundary rules specify when adjacent code points form a single cluster:
- A base character plus any number of combining marks (e.g., a + ́ → á)
- Hangul syllable sequences (L + V + T clusters, e.g., ㄱ + ㅏ + ㄴ → 간)
- Emoji modifier sequences (👋 + skin-tone modifier 🏽 → 👋🏽)
- Flag sequences (regional indicator J + P → 🇯🇵)
- Emoji ZWJ sequences (👨 + ZWJ + 👩 + ZWJ + 👧 → 👨👩👧)
- Emoji presentation sequences (digit + U+FE0F variation selector → 1️⃣)
Grapheme Cluster Iteration in Python
# Python's len() counts code points, not grapheme clusters
flag = "\U0001F1EF\U0001F1F5" # 🇯🇵 Japan flag (J + P regional indicators)
print(len(flag)) # 2 code points
family = "\U0001F468\u200D\U0001F469\u200D\U0001F467" # 👨👩👧
print(len(family)) # 5 code points
# For grapheme-aware string operations, use the 'grapheme' package
# pip install grapheme
try:
import grapheme
print(grapheme.length(flag)) # 1
print(grapheme.length(family)) # 1
# Grapheme-safe slicing
text = "café" # If stored as c + a + f + e + combining acute
nfd = "cafe\u0301"
print(grapheme.length(nfd)) # 4 (user sees 4 characters)
print(len(nfd)) # 5 (5 code points)
except ImportError:
print("Install 'grapheme' package for EGC support")
# Hangul example
hangul = "\u0067\u0041\u002F" # Not Hangul — just example of combining
syllable = "\uAC00" # 가 — precomposed Hangul syllable
jamo = "\u1100\u1161" # ᄀ + ᅡ — jamo sequence = same grapheme
import unicodedata
print(unicodedata.normalize("NFC", jamo) == syllable) # True
Why Grapheme Clusters Matter
Cursor movement and text editing: A text editor must advance the cursor by one grapheme cluster, not one code point. Moving one code point through 👋🏽 (two code points) would split the emoji in half, leaving a broken sequence.
String truncation: Truncating a string to 10 "characters" for display must count grapheme clusters. text[:10] in Python counts code points and may split an emoji sequence.
Regular expressions: The regex \X in PCRE2 and the Python regex package matches a single extended grapheme cluster, enabling grapheme-aware patterns.
Quick Facts
| Property | Value |
|---|---|
| Concept | Extended Grapheme Cluster (EGC) |
| Defined by | Unicode Standard Annex #29 (UAX #29) |
| Python built-in | No (use grapheme package or regex \X) |
| Common pitfall | len(s) counts code points, not clusters |
| Emoji clusters | ZWJ sequences, modifier sequences, flag sequences |
| Hangul | Jamo sequences form single grapheme clusters |
Istilah Terkait
Lainnya di Properti
Nama alternatif untuk karakter, karena nama Unicode tidak dapat diubah sesuai kebijakan …
Rentang titik kode berurutan yang dinamai (misalnya, Basic Latin = U+0000–U+007F). Unicode …
Karakter yang tidak memiliki efek visual dan dapat diabaikan oleh proses yang …
Pemetaan karakter ke bagian-bagian komponennya. Dekomposisi kanonik mempertahankan makna (é → e …
Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …
Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …
Properti yang menentukan bagaimana karakter berperilaku dalam teks dua arah (LTR, RTL, …
Klasifikasi setiap titik kode ke dalam salah satu dari 30 kategori (Lu, …
Nilai numerik (0–254) yang mengontrol pengurutan tanda penggabung selama dekomposisi kanonik, menentukan …
Dua urutan karakter yang secara semantik identik dan harus diperlakukan sama. Contoh: …