Sentence Boundary
The position between sentences per Unicode rules. More complex than splitting on periods — handles abbreviations (Mr.), ellipsis (...), and decimal points (3.14).
The Ambiguous Period Problem
Detecting sentence boundaries sounds simple: look for ., !, or ? followed by a space and a capital letter. But this heuristic fails immediately in real text. Consider:
Dr. Smith arrived.— period after "Dr" is an abbreviation, not a sentence endThe price is $3.14.— period in a decimal number, plus a terminal periodU.S.A. was founded...— initials and ellipsis"Really?" she asked.— period inside quotation marksHe said (see Fig. 3) that...— parenthetical abbreviation
UAX #29 Sentence Boundary rules handle these cases by examining context around potential boundary characters, using character properties and look-ahead/look-behind to disambiguate.
UAX #29 Sentence Boundary Algorithm
The algorithm assigns characters to sentence break properties and applies rules to find boundaries. Key properties:
| Property | Examples | Role |
|---|---|---|
| STerm | . ! ? |
Potential sentence terminators |
| ATerm | . |
Terminator that may be an abbreviation |
| Upper | A–Z | Uppercase letters (help detect abbreviations) |
| Lower | a–z | Lowercase letters |
| Close | ) " ' |
Closing punctuation after terminal |
| Sp | spaces | Post-terminal spacing |
| SContinue | , ; : |
Continuation characters (no break here) |
| Sep | newlines | Paragraph separators |
The critical distinction between STerm and ATerm: ! and ? are STerm only — they almost always end sentences. . is ATerm — it may end a sentence OR be part of an abbreviation or decimal number. The algorithm uses follow-up rules:
- If an ATerm (
.) is followed by an uppercase letter, it could be an abbreviation (Dr. Smith) or the start of a new sentence — context matters. - If an ATerm is followed by a lowercase letter, it is probably an abbreviation — no break.
- If an ATerm is followed by space(s) then an uppercase letter, it is probably a sentence end — boundary.
Python Sentence Segmentation
# Using ICU (most accurate UAX #29 compliance)
from icu import BreakIterator, Locale
text = "Dr. Smith arrived at 3.14 PM. He said hello. U.S.A. was mentioned."
bi = BreakIterator.createSentenceInstance(Locale("en_US"))
bi.setText(text)
start = 0
for end in bi:
sentence = text[start:end]
print(repr(sentence))
start = end
# "Dr. Smith arrived at 3.14 PM. "
# "He said hello. "
# "U.S.A. was mentioned."
# Using nltk for NLP-quality sentence tokenization
import nltk
sentences = nltk.sent_tokenize(text)
# NLTK uses a Punkt model trained on corpus data
# More accurate for abbreviations than pure rule-based UAX #29
NLP Applications
Sentence boundary detection (also called sentence segmentation or SBD) is a prerequisite for many NLP tasks: - Machine translation (translation is sentence-level) - Summarization (sentence scoring) - Named entity recognition (context window) - Sentiment analysis (per-sentence or per-review)
For high-accuracy NLP, corpus-trained models (like NLTK's Punkt or spaCy's sentencizer) typically outperform pure rule-based UAX #29 because they learn abbreviation lists from training data. UAX #29 provides a principled baseline that works without training data.
Quick Facts
| Property | Value |
|---|---|
| Specification | UAX #29, Section 5 (Sentence Boundaries) |
| Key challenge | Disambiguating period as abbreviation vs. sentence terminator |
| ATerm | . — may or may not end a sentence |
| STerm | ! ? — almost always end a sentence |
| Python ICU | BreakIterator.createSentenceInstance(Locale("en_US")) |
| NLP alternative | NLTK Punkt, spaCy sentencizer (corpus-trained) |
| Practical note | No algorithm is 100% accurate — ambiguity is inherent |
Related Terms
More in Algorithms
Mapping characters to a common case form for case-insensitive comparison. More comprehensive …
Characters excluded from canonical composition (NFC) to prevent non-starter decomposition and ensure …
Rules (UAX#29) for determining where one user-perceived character ends and another begins. …
Normalization Form C: decompose then recompose canonically, producing the shortest form. Recommended …
Normalization Form D: fully decompose without recomposing. Used by the macOS HFS+ …
Normalization Form KC: compatibility decomposition then canonical composition. Merges visually similar characters …
Normalization Form KD: compatibility decomposition without recomposing. The most aggressive normalization, losing …
Comparing Unicode strings requires normalization (NFC/NFD) and optionally collation (locale-aware sorting). Binary …
Algorithm determining display order of characters in mixed-direction text (e.g., English + …
Standard algorithm for comparing and sorting Unicode strings using multi-level comparison: base …