What is Unicode 文本分割?

查找文本中各类边界的算法：字素簇、词和句子边界，对光标移动、文本选择和文本处理至关重要。

由Unicode词边界规则确定的词间位置，不是简单地按空格分割，而是正确处理CJK（无空格）、缩写和数字。

算法

句子边界

按照Unicode规则确定的句子间位置，比按句号分割更复杂，能正确处理缩写（Mr.）、省略号（...）和小数点（3.14）。

2022-11-14 · Updated 2024-09-15

The Ambiguous Period Problem

Detecting sentence boundaries sounds simple: look for ., !, or ? followed by a space and a capital letter. But this heuristic fails immediately in real text. Consider:

Dr. Smith arrived. — period after "Dr" is an abbreviation, not a sentence end
The price is $3.14. — period in a decimal number, plus a terminal period
U.S.A. was founded... — initials and ellipsis
"Really?" she asked. — period inside quotation marks
He said (see Fig. 3) that... — parenthetical abbreviation

UAX #29 Sentence Boundary rules handle these cases by examining context around potential boundary characters, using character properties and look-ahead/look-behind to disambiguate.

UAX #29 Sentence Boundary Algorithm

The algorithm assigns characters to sentence break properties and applies rules to find boundaries. Key properties:

Property	Examples	Role
STerm	`.` `!` `?`	Potential sentence terminators
ATerm	`.`	Terminator that may be an abbreviation
Upper	A–Z	Uppercase letters (help detect abbreviations)
Lower	a–z	Lowercase letters
Close	`)` `"` `'`	Closing punctuation after terminal
Sp	spaces	Post-terminal spacing
SContinue	`,` `;` `:`	Continuation characters (no break here)
Sep	newlines	Paragraph separators

The critical distinction between STerm and ATerm: ! and ? are STerm only — they almost always end sentences. . is ATerm — it may end a sentence OR be part of an abbreviation or decimal number. The algorithm uses follow-up rules:

If an ATerm (.) is followed by an uppercase letter, it could be an abbreviation (Dr. Smith) or the start of a new sentence — context matters.
If an ATerm is followed by a lowercase letter, it is probably an abbreviation — no break.
If an ATerm is followed by space(s) then an uppercase letter, it is probably a sentence end — boundary.

Python Sentence Segmentation

# Using ICU (most accurate UAX #29 compliance)
from icu import BreakIterator, Locale

text = "Dr. Smith arrived at 3.14 PM. He said hello. U.S.A. was mentioned."
bi = BreakIterator.createSentenceInstance(Locale("en_US"))
bi.setText(text)

start = 0
for end in bi:
    sentence = text[start:end]
    print(repr(sentence))
    start = end
# "Dr. Smith arrived at 3.14 PM. "
# "He said hello. "
# "U.S.A. was mentioned."

# Using nltk for NLP-quality sentence tokenization
import nltk
sentences = nltk.sent_tokenize(text)
# NLTK uses a Punkt model trained on corpus data
# More accurate for abbreviations than pure rule-based UAX #29

NLP Applications

Sentence boundary detection (also called sentence segmentation or SBD) is a prerequisite for many NLP tasks: - Machine translation (translation is sentence-level) - Summarization (sentence scoring) - Named entity recognition (context window) - Sentiment analysis (per-sentence or per-review)

For high-accuracy NLP, corpus-trained models (like NLTK's Punkt or spaCy's sentencizer) typically outperform pure rule-based UAX #29 because they learn abbreviation lists from training data. UAX #29 provides a principled baseline that works without training data.

Quick Facts

Property	Value
Specification	UAX #29, Section 5 (Sentence Boundaries)
Key challenge	Disambiguating period as abbreviation vs. sentence terminator
ATerm	`.` — may or may not end a sentence
STerm	`!` `?` — almost always end a sentence
Python ICU	`BreakIterator.createSentenceInstance(Locale("en_US"))`
NLP alternative	NLTK Punkt, spaCy sentencizer (corpus-trained)
Practical note	No algorithm is 100% accurate — ambiguity is inherent