عنقود الجرافيم
'الحرف' كما يدركه المستخدم — ما يبدو كوحدة واحدة؛ قد يتكون من نقاط رموز متعددة (قاعدة + علامات مركّبة، أو تسلسلات emoji ZWJ)؛ 👩💻 = 3 نقاط رموز، 1 grapheme.
What Is a Grapheme Cluster?
A grapheme cluster is what a user perceives as a single "character" on screen—what you see when you tap the delete key once or advance the cursor by one position. Unicode code points and user-perceived characters are not the same: a single grapheme cluster may consist of multiple code points.
The Unicode Standard defines Extended Grapheme Clusters (EGC) in Unicode Standard Annex #29. The boundary rules specify when adjacent code points form a single cluster:
- A base character plus any number of combining marks (e.g., a + ́ → á)
- Hangul syllable sequences (L + V + T clusters, e.g., ㄱ + ㅏ + ㄴ → 간)
- Emoji modifier sequences (👋 + skin-tone modifier 🏽 → 👋🏽)
- Flag sequences (regional indicator J + P → 🇯🇵)
- Emoji ZWJ sequences (👨 + ZWJ + 👩 + ZWJ + 👧 → 👨👩👧)
- Emoji presentation sequences (digit + U+FE0F variation selector → 1️⃣)
Grapheme Cluster Iteration in Python
# Python's len() counts code points, not grapheme clusters
flag = "\U0001F1EF\U0001F1F5" # 🇯🇵 Japan flag (J + P regional indicators)
print(len(flag)) # 2 code points
family = "\U0001F468\u200D\U0001F469\u200D\U0001F467" # 👨👩👧
print(len(family)) # 5 code points
# For grapheme-aware string operations, use the 'grapheme' package
# pip install grapheme
try:
import grapheme
print(grapheme.length(flag)) # 1
print(grapheme.length(family)) # 1
# Grapheme-safe slicing
text = "café" # If stored as c + a + f + e + combining acute
nfd = "cafe\u0301"
print(grapheme.length(nfd)) # 4 (user sees 4 characters)
print(len(nfd)) # 5 (5 code points)
except ImportError:
print("Install 'grapheme' package for EGC support")
# Hangul example
hangul = "\u0067\u0041\u002F" # Not Hangul — just example of combining
syllable = "\uAC00" # 가 — precomposed Hangul syllable
jamo = "\u1100\u1161" # ᄀ + ᅡ — jamo sequence = same grapheme
import unicodedata
print(unicodedata.normalize("NFC", jamo) == syllable) # True
Why Grapheme Clusters Matter
Cursor movement and text editing: A text editor must advance the cursor by one grapheme cluster, not one code point. Moving one code point through 👋🏽 (two code points) would split the emoji in half, leaving a broken sequence.
String truncation: Truncating a string to 10 "characters" for display must count grapheme clusters. text[:10] in Python counts code points and may split an emoji sequence.
Regular expressions: The regex \X in PCRE2 and the Python regex package matches a single extended grapheme cluster, enabling grapheme-aware patterns.
Quick Facts
| Property | Value |
|---|---|
| Concept | Extended Grapheme Cluster (EGC) |
| Defined by | Unicode Standard Annex #29 (UAX #29) |
| Python built-in | No (use grapheme package or regex \X) |
| Common pitfall | len(s) counts code points, not clusters |
| Emoji clusters | ZWJ sequences, modifier sequences, flag sequences |
| Hangul | Jamo sequences form single grapheme clusters |
المصطلحات ذات الصلة
المزيد في الخصائص
Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …
Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …
Unicode property listing all scripts that use a character, broader than the …
أسماء بديلة للأحرف، نظرًا لأن أسماء Unicode لا يمكن تغييرها وفقًا لسياسة …
تحويل الحرف إلى مكوناته؛ التفكيك الكنسي يحافظ على المعنى (é → e …
تسلسلان من الأحرف متطابقان دلاليًا ويجب معاملتهما كمتساويين؛ مثال: é (U+00E9) ≡ …
تصنيف كل نقطة رمز إلى واحدة من 30 فئة (Lu, Ll, Nd, …
التفسير الرقمي للحرف إن وُجد: قيمة الرقم (0–9)، قيمة عشرية، أو قيمة …
قواعد تحويل الأحرف بين الأحرف الكبيرة والصغيرة وأحرف العنوان؛ قد تعتمد على …
تسلسلان من الأحرف لهما نفس المحتوى المجرد لكن قد يختلفان في المظهر؛ …