Unicode 文本分割
查找文本中各类边界的算法:字素簇、词和句子边界,对光标移动、文本选择和文本处理至关重要。
Boundaries in Unicode Text
Text is not a flat sequence of code points — it has structure. Users think in terms of characters, words, and sentences. But Unicode code points do not map cleanly to these concepts. A single visible character (a grapheme cluster) can span multiple code points. A "word" means different things in English, Japanese, and Arabic. A sentence boundary after a period is ambiguous when periods also appear in abbreviations and numbers.
Unicode Text Segmentation (UAX #29) defines algorithms for finding grapheme cluster boundaries, word boundaries, and sentence boundaries. These algorithms are the foundation for correct cursor movement, text selection, word counting, and spell checking in any Unicode-aware application.
The Grapheme Cluster Problem
Python's len() function counts code points, not user-perceived characters:
# Emoji with ZWJ sequence: 1 visible character, 7 code points
family = "\U0001F468\u200D\U0001F469\u200D\U0001F467\u200D\U0001F466"
print(len(family)) # 7 (code points)
# User sees: 👨👩👧👦 (one family emoji)
# Combining characters
cafe = "cafe\u0301" # e + combining acute = é
print(len(cafe)) # 5 (code points)
print(len("café")) # 4 (precomposed NFC)
# Both render as "café" — 4 user-perceived characters
# Flag emoji: 2 regional indicator symbols = 1 flag
flag = "\U0001F1FA\U0001F1F8" # 🇺🇸
print(len(flag)) # 2 (code points)
# User sees: 🇺🇸 (1 flag)
A grapheme cluster is the minimal unit a user thinks of as a single character. UAX #29 defines grapheme cluster boundary rules that handle: - Base + combining marks - Hangul syllable sequences (jamo combining rules) - Regional indicator pairs (flags) - Zero Width Joiner (ZWJ) sequences (family/profession emoji) - Extend characters (tags, emoji modifiers)
Using UAX #29 in Python
The grapheme package provides UAX #29-compliant grapheme cluster segmentation:
# pip install grapheme
import grapheme
family = "\U0001F468\u200D\U0001F469\u200D\U0001F467\u200D\U0001F466"
print(grapheme.length(family)) # 1
print(list(grapheme.graphemes(family))) # ['👨👩👧👦']
text = "Hello, 世界! 🌍"
print(grapheme.length(text)) # 11 (user-perceived chars)
# Safe string slicing (by grapheme, not code point)
print(grapheme.slice(text, 0, 5)) # 'Hello'
For industrial-strength segmentation including word and sentence boundaries, use ICU via PyICU:
from icu import BreakIterator, Locale
text = "Don't stop. Dr. Smith arrived at 3.14 PM."
bi = BreakIterator.createSentenceInstance(Locale("en_US"))
bi.setText(text)
start = 0
for end in bi:
print(repr(text[start:end]))
start = end
# "Don't stop. " | "Dr. Smith arrived at 3.14 PM."
Quick Facts
| Property | Value |
|---|---|
| Specification | Unicode Standard Annex #29 (UAX #29) |
| Boundary types | Grapheme cluster, word, sentence |
Python len() |
Counts code points, not grapheme clusters |
| Python package | grapheme (pip install grapheme) |
| Full ICU support | PyICU — BreakIterator.createGraphemeInstance() etc. |
| ZWJ sequences | Zero Width Joiner (U+200D) joins emoji into single grapheme cluster |
| Regional indicators | Two regional indicator letters form a single flag grapheme cluster |
| Hangul | Jamo sequences (L + V + T) form a single syllable grapheme cluster |
相关术语
算法 中的更多内容
Mapping characters to a common case form for case-insensitive comparison. More comprehensive …
Rules (UAX#29) for determining where one user-perceived character ends and another begins. …
规范化形式C:先分解再规范合成,生成最短形式,推荐用于数据存储和交换,是Web标准形式。
规范化形式D:完全分解而不重新合成,macOS HFS+文件系统使用此形式。é(U+00E9)→ e + ◌́(U+0065 + U+0301)。
规范化形式KC:兼容分解后再规范合成,合并视觉上相似的字符(fi→fi、²→2、Ⅳ→IV),用于标识符比较。
规范化形式KD:兼容分解而不重新合成,是最激进的规范化方式,会丢失最多的格式信息。
Comparing Unicode strings requires normalization (NFC/NFD) and optionally collation (locale-aware sorting). Binary …
利用字符双向类别和明确方向覆盖,确定混合方向文本(如英语+阿拉伯语)显示顺序的算法。
根据字符属性、CJK词边界和换行时机,确定文本可换至下一行位置的规则。
通过多级比较(基础字符→变音符号→大小写→决胜符)对Unicode字符串进行比较和排序的标准算法,支持区域设置自定义。