What is نقطة الترميز?

قيمة عددية في فضاء ترميز Unicode (U+0000 إلى U+10FFFF)، تُكتب كـ U+XXXX. ليست كل نقاط الترميز معينة لأحرف.

What is حرف مدمج?

حرف يلتصق بالحرف الأساسي السابق لتعديله. الفئة العامة: Mn (غير متباعد)، Mc (تجميعي متباعد)، Me (محيط). مثال: ◌́ (U+0301 Combining Acute).

What is وحدة الترميز?

الوحدة الأصغر للترميز: بايت 8 بت في UTF-8، كلمة 16 بت في UTF-16، كلمة 32 بت في UTF-32. قد يتطلب حرف واحد وحدات ترميز متعددة.

الخصائص

عنقود الجرافيم

'الحرف' كما يدركه المستخدم — ما يبدو كوحدة واحدة؛ قد يتكون من نقاط رموز متعددة (قاعدة + علامات مركّبة، أو تسلسلات emoji ZWJ)؛ 👩‍💻 = 3 نقاط رموز، 1 grapheme.

2022-06-13 · Updated 2024-11-07

What Is a Grapheme Cluster?

A grapheme cluster is what a user perceives as a single "character" on screen—what you see when you tap the delete key once or advance the cursor by one position. Unicode code points and user-perceived characters are not the same: a single grapheme cluster may consist of multiple code points.

The Unicode Standard defines Extended Grapheme Clusters (EGC) in Unicode Standard Annex #29. The boundary rules specify when adjacent code points form a single cluster:

A base character plus any number of combining marks (e.g., a + ́ → á)
Hangul syllable sequences (L + V + T clusters, e.g., ㄱ + ㅏ + ㄴ → 간)
Emoji modifier sequences (👋 + skin-tone modifier 🏽 → 👋🏽)
Flag sequences (regional indicator J + P → 🇯🇵)
Emoji ZWJ sequences (👨 + ZWJ + 👩 + ZWJ + 👧 → 👨‍👩‍👧)
Emoji presentation sequences (digit + U+FE0F variation selector → 1️⃣)

Grapheme Cluster Iteration in Python

# Python's len() counts code points, not grapheme clusters
flag = "\U0001F1EF\U0001F1F5"   # 🇯🇵 Japan flag (J + P regional indicators)
print(len(flag))                 # 2 code points

family = "\U0001F468\u200D\U0001F469\u200D\U0001F467"  # 👨‍👩‍👧
print(len(family))               # 5 code points

# For grapheme-aware string operations, use the 'grapheme' package
# pip install grapheme
try:
    import grapheme
    print(grapheme.length(flag))    # 1
    print(grapheme.length(family))  # 1

    # Grapheme-safe slicing
    text = "café"  # If stored as c + a + f + e + combining acute
    nfd = "cafe\u0301"
    print(grapheme.length(nfd))     # 4 (user sees 4 characters)
    print(len(nfd))                 # 5 (5 code points)
except ImportError:
    print("Install 'grapheme' package for EGC support")

# Hangul example
hangul = "\u0067\u0041\u002F"   # Not Hangul — just example of combining
syllable = "\uAC00"             # 가 — precomposed Hangul syllable
jamo = "\u1100\u1161"           # ᄀ + ᅡ — jamo sequence = same grapheme
import unicodedata
print(unicodedata.normalize("NFC", jamo) == syllable)  # True

Why Grapheme Clusters Matter

Cursor movement and text editing: A text editor must advance the cursor by one grapheme cluster, not one code point. Moving one code point through 👋🏽 (two code points) would split the emoji in half, leaving a broken sequence.

String truncation: Truncating a string to 10 "characters" for display must count grapheme clusters. text[:10] in Python counts code points and may split an emoji sequence.

Regular expressions: The regex \X in PCRE2 and the Python regex package matches a single extended grapheme cluster, enabling grapheme-aware patterns.

Quick Facts

Property	Value
Concept	Extended Grapheme Cluster (EGC)
Defined by	Unicode Standard Annex #29 (UAX #29)
Python built-in	No (use `grapheme` package or `regex` `\X`)
Common pitfall	`len(s)` counts code points, not clusters
Emoji clusters	ZWJ sequences, modifier sequences, flag sequences
Hangul	Jamo sequences form single grapheme clusters

المصطلحات ذات الصلة

نقطة الترميز حرف مدمج وحدة الترميز