What is กลุ่มกราฟีม?

อักขระที่ผู้ใช้รับรู้ได้ — สิ่งที่รู้สึกเหมือนหน่วยเดียว อาจประกอบด้วยหลายจุดรหัส (ฐาน + เครื่องหมายรวม หรือลำดับ emoji ZWJ) 👩💻 = 3 จุดรหัส, 1 grapheme

What is ขอบเขตประโยค?

ตำแหน่งระหว่างประโยคตามกฎ Unicode ซับซ้อนกว่าการแบ่งตามจุดธรรมดา รองรับคำย่อ (Mr.) จุดไข่ปลา (...) และจุดทศนิยม (3.14)

อัลกอริทึม

การแบ่งส่วนข้อความ

อัลกอริธึมสำหรับค้นหาขอบเขตในข้อความ: ขอบเขต grapheme cluster, คำ และประโยค มีความสำคัญสำหรับการเลื่อนเคอร์เซอร์ การเลือกข้อความ และการประมวลผลข้อความ

2022-09-26 · Updated 2024-12-02

Boundaries in Unicode Text

Text is not a flat sequence of code points — it has structure. Users think in terms of characters, words, and sentences. But Unicode code points do not map cleanly to these concepts. A single visible character (a grapheme cluster) can span multiple code points. A "word" means different things in English, Japanese, and Arabic. A sentence boundary after a period is ambiguous when periods also appear in abbreviations and numbers.

Unicode Text Segmentation (UAX #29) defines algorithms for finding grapheme cluster boundaries, word boundaries, and sentence boundaries. These algorithms are the foundation for correct cursor movement, text selection, word counting, and spell checking in any Unicode-aware application.

The Grapheme Cluster Problem

Python's len() function counts code points, not user-perceived characters:

# Emoji with ZWJ sequence: 1 visible character, 7 code points
family = "\U0001F468\u200D\U0001F469\u200D\U0001F467\u200D\U0001F466"
print(len(family))        # 7 (code points)
# User sees: 👨‍👩‍👧‍👦 (one family emoji)

# Combining characters
cafe = "cafe\u0301"       # e + combining acute = é
print(len(cafe))           # 5 (code points)
print(len("café"))         # 4 (precomposed NFC)
# Both render as "café" — 4 user-perceived characters

# Flag emoji: 2 regional indicator symbols = 1 flag
flag = "\U0001F1FA\U0001F1F8"  # 🇺🇸
print(len(flag))           # 2 (code points)
# User sees: 🇺🇸 (1 flag)

A grapheme cluster is the minimal unit a user thinks of as a single character. UAX #29 defines grapheme cluster boundary rules that handle: - Base + combining marks - Hangul syllable sequences (jamo combining rules) - Regional indicator pairs (flags) - Zero Width Joiner (ZWJ) sequences (family/profession emoji) - Extend characters (tags, emoji modifiers)

Using UAX #29 in Python

The grapheme package provides UAX #29-compliant grapheme cluster segmentation:

# pip install grapheme
import grapheme

family = "\U0001F468\u200D\U0001F469\u200D\U0001F467\u200D\U0001F466"
print(grapheme.length(family))           # 1
print(list(grapheme.graphemes(family)))  # ['👨‍👩‍👧‍👦']

text = "Hello, 世界! 🌍"
print(grapheme.length(text))             # 11 (user-perceived chars)

# Safe string slicing (by grapheme, not code point)
print(grapheme.slice(text, 0, 5))        # 'Hello'

For industrial-strength segmentation including word and sentence boundaries, use ICU via PyICU:

from icu import BreakIterator, Locale

text = "Don't stop. Dr. Smith arrived at 3.14 PM."
bi = BreakIterator.createSentenceInstance(Locale("en_US"))
bi.setText(text)
start = 0
for end in bi:
    print(repr(text[start:end]))
    start = end
# "Don't stop. " | "Dr. Smith arrived at 3.14 PM."

Quick Facts

Property	Value
Specification	Unicode Standard Annex #29 (UAX #29)
Boundary types	Grapheme cluster, word, sentence
Python `len()`	Counts code points, not grapheme clusters
Python package	`grapheme` (pip install grapheme)
Full ICU support	`PyICU` — `BreakIterator.createGraphemeInstance()` etc.
ZWJ sequences	Zero Width Joiner (U+200D) joins emoji into single grapheme cluster
Regional indicators	Two regional indicator letters form a single flag grapheme cluster
Hangul	Jamo sequences (L + V + T) form a single syllable grapheme cluster

คำศัพท์ที่เกี่ยวข้อง

กลุ่มกราฟีม ขอบเขตคำ ขอบเขตประโยค

เพิ่มเติมใน อัลกอริทึม

Case Folding

Mapping characters to a common case form for case-insensitive comparison. More comprehensive …

Grapheme Cluster Boundary

Rules (UAX#29) for determining where one user-perceived character ends and another begins. …

NFC (Canonical Composition)

Normalization Form C: แยกส่วนแล้วรวมใหม่แบบ canonical ได้รูปแบบที่สั้นที่สุด แนะนำสำหรับการจัดเก็บและแลกเปลี่ยนข้อมูล เป็นรูปแบบมาตรฐานของเว็บ

NFD (Canonical Decomposition)

Normalization Form D: แยกส่วนอย่างสมบูรณ์โดยไม่รวมใหม่ ใช้โดยระบบไฟล์ macOS HFS+ é (U+00E9) → e + …

NFKC (Compatibility Composition)

Normalization Form KC: แยกส่วนแบบ compatibility แล้วรวมแบบ canonical รวมอักขระที่มีลักษณะคล้ายกัน (ﬁ→fi, ²→2, Ⅳ→IV) ใช้สำหรับการเปรียบเทียบตัวระบุ

NFKD (Compatibility Decomposition)

Normalization Form KD: แยกส่วนแบบ compatibility โดยไม่รวมใหม่ เป็นการ normalize ที่เข้มงวดที่สุด สูญเสียข้อมูลการจัดรูปแบบมากที่สุด

String Comparison

Comparing Unicode strings requires normalization (NFC/NFD) and optionally collation (locale-aware sorting). Binary …

การทำให้เป็นมาตรฐาน

กระบวนการแปลงข้อความ Unicode เป็นรูปแบบ canonical มาตรฐาน มี 4 รูปแบบ: NFC (รวม), NFD (แยก), …

การยกเว้นการประกอบ

อักขระที่ถูกยกเว้นจากการรวมแบบ canonical (NFC) เพื่อป้องกันการแตกย่อยแบบ non-starter และรับประกันความเสถียรของอัลกอริทึม ระบุไว้ใน CompositionExclusions.txt

ขอบเขตคำ

ตำแหน่งระหว่างคำตามกฎการแบ่งคำของ Unicode ไม่ใช่แค่การแบ่งตามช่องว่าง แต่รองรับ CJK (ไม่มีช่องว่าง) คำย่อ และตัวเลขอย่างถูกต้อง

← กลับไปยังอภิธานศัพท์