What is 코드 포인트?

유니코드 코드 공간(U+0000~U+10FFFF) 내의 수치 값으로, U+XXXX 형식으로 표기합니다. 모든 코드 포인트가 문자에 할당된 것은 아닙니다.

What is 결합 문자?

앞의 기본 문자에 붙어 수정하는 문자. 일반 범주: Mn(비공백), Mc(공백 결합), Me(둘러싸기). 예: ◌́ (U+0301 결합 예음 부호).

What is 코드 단위?

인코딩의 최소 단위: UTF-8에서는 8비트 바이트, UTF-16에서는 16비트 워드, UTF-32에서는 32비트 워드. 하나의 문자가 여러 코드 단위를 필요로 할 수 있습니다.

속성

확장 문자소 클러스터

사용자가 인식하는 '문자' — 단일 단위처럼 느껴지는 것. 여러 코드 포인트(기본 문자 + 결합 기호, 또는 이모지 ZWJ 시퀀스)로 구성될 수 있습니다. 👩‍💻 = 3개 코드 포인트, 1개 문자소.

2022-06-13 · Updated 2024-11-07

What Is a Grapheme Cluster?

A grapheme cluster is what a user perceives as a single "character" on screen—what you see when you tap the delete key once or advance the cursor by one position. Unicode code points and user-perceived characters are not the same: a single grapheme cluster may consist of multiple code points.

The Unicode Standard defines Extended Grapheme Clusters (EGC) in Unicode Standard Annex #29. The boundary rules specify when adjacent code points form a single cluster:

A base character plus any number of combining marks (e.g., a + ́ → á)
Hangul syllable sequences (L + V + T clusters, e.g., ㄱ + ㅏ + ㄴ → 간)
Emoji modifier sequences (👋 + skin-tone modifier 🏽 → 👋🏽)
Flag sequences (regional indicator J + P → 🇯🇵)
Emoji ZWJ sequences (👨 + ZWJ + 👩 + ZWJ + 👧 → 👨‍👩‍👧)
Emoji presentation sequences (digit + U+FE0F variation selector → 1️⃣)

Grapheme Cluster Iteration in Python

# Python's len() counts code points, not grapheme clusters
flag = "\U0001F1EF\U0001F1F5"   # 🇯🇵 Japan flag (J + P regional indicators)
print(len(flag))                 # 2 code points

family = "\U0001F468\u200D\U0001F469\u200D\U0001F467"  # 👨‍👩‍👧
print(len(family))               # 5 code points

# For grapheme-aware string operations, use the 'grapheme' package
# pip install grapheme
try:
    import grapheme
    print(grapheme.length(flag))    # 1
    print(grapheme.length(family))  # 1

    # Grapheme-safe slicing
    text = "café"  # If stored as c + a + f + e + combining acute
    nfd = "cafe\u0301"
    print(grapheme.length(nfd))     # 4 (user sees 4 characters)
    print(len(nfd))                 # 5 (5 code points)
except ImportError:
    print("Install 'grapheme' package for EGC support")

# Hangul example
hangul = "\u0067\u0041\u002F"   # Not Hangul — just example of combining
syllable = "\uAC00"             # 가 — precomposed Hangul syllable
jamo = "\u1100\u1161"           # ᄀ + ᅡ — jamo sequence = same grapheme
import unicodedata
print(unicodedata.normalize("NFC", jamo) == syllable)  # True

Why Grapheme Clusters Matter

Cursor movement and text editing: A text editor must advance the cursor by one grapheme cluster, not one code point. Moving one code point through 👋🏽 (two code points) would split the emoji in half, leaving a broken sequence.

String truncation: Truncating a string to 10 "characters" for display must count grapheme clusters. text[:10] in Python counts code points and may split an emoji sequence.

Regular expressions: The regex \X in PCRE2 and the Python regex package matches a single extended grapheme cluster, enabling grapheme-aware patterns.

Quick Facts

Property	Value
Concept	Extended Grapheme Cluster (EGC)
Defined by	Unicode Standard Annex #29 (UAX #29)
Python built-in	No (use `grapheme` package or `regex` `\X`)
Common pitfall	`len(s)` counts code points, not clusters
Emoji clusters	ZWJ sequences, modifier sequences, flag sequences
Hangul	Jamo sequences form single grapheme clusters