What is Unicode Text Segmentation?

Algorithms for finding boundaries in text: grapheme cluster, word, and sentence boundaries. Critical for cursor movement, text selection, and text processing.

What is Word Boundary?

The position between words as determined by Unicode word break rules. Not a simple split on spaces — handles CJK (no spaces), contractions, and numbers correctly.

Algorithms

Sentence Boundary

The position between sentences per Unicode rules. More complex than splitting on periods — handles abbreviations (Mr.), ellipsis (...), and decimal points (3.14).

2022-11-14 · Updated 2024-09-15

The Ambiguous Period Problem

Detecting sentence boundaries sounds simple: look for ., !, or ? followed by a space and a capital letter. But this heuristic fails immediately in real text. Consider:

Dr. Smith arrived. — period after "Dr" is an abbreviation, not a sentence end
The price is $3.14. — period in a decimal number, plus a terminal period
U.S.A. was founded... — initials and ellipsis
"Really?" she asked. — period inside quotation marks
He said (see Fig. 3) that... — parenthetical abbreviation

UAX #29 Sentence Boundary rules handle these cases by examining context around potential boundary characters, using character properties and look-ahead/look-behind to disambiguate.

UAX #29 Sentence Boundary Algorithm

The algorithm assigns characters to sentence break properties and applies rules to find boundaries. Key properties:

Property	Examples	Role
STerm	`.` `!` `?`	Potential sentence terminators
ATerm	`.`	Terminator that may be an abbreviation
Upper	A–Z	Uppercase letters (help detect abbreviations)
Lower	a–z	Lowercase letters
Close	`)` `"` `'`	Closing punctuation after terminal
Sp	spaces	Post-terminal spacing
SContinue	`,` `;` `:`	Continuation characters (no break here)
Sep	newlines	Paragraph separators

The critical distinction between STerm and ATerm: ! and ? are STerm only — they almost always end sentences. . is ATerm — it may end a sentence OR be part of an abbreviation or decimal number. The algorithm uses follow-up rules:

If an ATerm (.) is followed by an uppercase letter, it could be an abbreviation (Dr. Smith) or the start of a new sentence — context matters.
If an ATerm is followed by a lowercase letter, it is probably an abbreviation — no break.
If an ATerm is followed by space(s) then an uppercase letter, it is probably a sentence end — boundary.

Python Sentence Segmentation

# Using ICU (most accurate UAX #29 compliance)
from icu import BreakIterator, Locale

text = "Dr. Smith arrived at 3.14 PM. He said hello. U.S.A. was mentioned."
bi = BreakIterator.createSentenceInstance(Locale("en_US"))
bi.setText(text)

start = 0
for end in bi:
    sentence = text[start:end]
    print(repr(sentence))
    start = end
# "Dr. Smith arrived at 3.14 PM. "
# "He said hello. "
# "U.S.A. was mentioned."

# Using nltk for NLP-quality sentence tokenization
import nltk
sentences = nltk.sent_tokenize(text)
# NLTK uses a Punkt model trained on corpus data
# More accurate for abbreviations than pure rule-based UAX #29

NLP Applications

Sentence boundary detection (also called sentence segmentation or SBD) is a prerequisite for many NLP tasks: - Machine translation (translation is sentence-level) - Summarization (sentence scoring) - Named entity recognition (context window) - Sentiment analysis (per-sentence or per-review)

For high-accuracy NLP, corpus-trained models (like NLTK's Punkt or spaCy's sentencizer) typically outperform pure rule-based UAX #29 because they learn abbreviation lists from training data. UAX #29 provides a principled baseline that works without training data.

Quick Facts

Property	Value
Specification	UAX #29, Section 5 (Sentence Boundaries)
Key challenge	Disambiguating period as abbreviation vs. sentence terminator
ATerm	`.` — may or may not end a sentence
STerm	`!` `?` — almost always end a sentence
Python ICU	`BreakIterator.createSentenceInstance(Locale("en_US"))`
NLP alternative	NLTK Punkt, spaCy sentencizer (corpus-trained)
Practical note	No algorithm is 100% accurate — ambiguity is inherent

Related Terms

Unicode Text Segmentation Word Boundary

More in Algorithms

Case Folding

Mapping characters to a common case form for case-insensitive comparison. More comprehensive …

Composition Exclusion

Characters excluded from canonical composition (NFC) to prevent non-starter decomposition and ensure …

Grapheme Cluster Boundary

Rules (UAX#29) for determining where one user-perceived character ends and another begins. …

NFC (Canonical Composition)

Normalization Form C: decompose then recompose canonically, producing the shortest form. Recommended …

NFD (Canonical Decomposition)

Normalization Form D: fully decompose without recomposing. Used by the macOS HFS+ …

NFKC (Compatibility Composition)

Normalization Form KC: compatibility decomposition then canonical composition. Merges visually similar characters …

NFKD (Compatibility Decomposition)

Normalization Form KD: compatibility decomposition without recomposing. The most aggressive normalization, losing …

String Comparison

Comparing Unicode strings requires normalization (NFC/NFD) and optionally collation (locale-aware sorting). Binary …

Unicode Bidirectional Algorithm (UBA)

Algorithm determining display order of characters in mixed-direction text (e.g., English + …

Unicode Collation Algorithm (UCA)

Standard algorithm for comparing and sorting Unicode strings using multi-level comparison: base …

← Back to Glossary