Frontière de phrase
La position entre les phrases selon les règles Unicode. Plus complexe qu'un simple découpage sur les points — gère les abréviations (M.), les points de suspension (...) et les nombres décimaux (3,14).
The Ambiguous Period Problem
Detecting sentence boundaries sounds simple: look for ., !, or ? followed by a space and a capital letter. But this heuristic fails immediately in real text. Consider:
Dr. Smith arrived.— period after "Dr" is an abbreviation, not a sentence endThe price is $3.14.— period in a decimal number, plus a terminal periodU.S.A. was founded...— initials and ellipsis"Really?" she asked.— period inside quotation marksHe said (see Fig. 3) that...— parenthetical abbreviation
UAX #29 Sentence Boundary rules handle these cases by examining context around potential boundary characters, using character properties and look-ahead/look-behind to disambiguate.
UAX #29 Sentence Boundary Algorithm
The algorithm assigns characters to sentence break properties and applies rules to find boundaries. Key properties:
| Property | Examples | Role |
|---|---|---|
| STerm | . ! ? |
Potential sentence terminators |
| ATerm | . |
Terminator that may be an abbreviation |
| Upper | A–Z | Uppercase letters (help detect abbreviations) |
| Lower | a–z | Lowercase letters |
| Close | ) " ' |
Closing punctuation after terminal |
| Sp | spaces | Post-terminal spacing |
| SContinue | , ; : |
Continuation characters (no break here) |
| Sep | newlines | Paragraph separators |
The critical distinction between STerm and ATerm: ! and ? are STerm only — they almost always end sentences. . is ATerm — it may end a sentence OR be part of an abbreviation or decimal number. The algorithm uses follow-up rules:
- If an ATerm (
.) is followed by an uppercase letter, it could be an abbreviation (Dr. Smith) or the start of a new sentence — context matters. - If an ATerm is followed by a lowercase letter, it is probably an abbreviation — no break.
- If an ATerm is followed by space(s) then an uppercase letter, it is probably a sentence end — boundary.
Python Sentence Segmentation
# Using ICU (most accurate UAX #29 compliance)
from icu import BreakIterator, Locale
text = "Dr. Smith arrived at 3.14 PM. He said hello. U.S.A. was mentioned."
bi = BreakIterator.createSentenceInstance(Locale("en_US"))
bi.setText(text)
start = 0
for end in bi:
sentence = text[start:end]
print(repr(sentence))
start = end
# "Dr. Smith arrived at 3.14 PM. "
# "He said hello. "
# "U.S.A. was mentioned."
# Using nltk for NLP-quality sentence tokenization
import nltk
sentences = nltk.sent_tokenize(text)
# NLTK uses a Punkt model trained on corpus data
# More accurate for abbreviations than pure rule-based UAX #29
NLP Applications
Sentence boundary detection (also called sentence segmentation or SBD) is a prerequisite for many NLP tasks: - Machine translation (translation is sentence-level) - Summarization (sentence scoring) - Named entity recognition (context window) - Sentiment analysis (per-sentence or per-review)
For high-accuracy NLP, corpus-trained models (like NLTK's Punkt or spaCy's sentencizer) typically outperform pure rule-based UAX #29 because they learn abbreviation lists from training data. UAX #29 provides a principled baseline that works without training data.
Quick Facts
| Property | Value |
|---|---|
| Specification | UAX #29, Section 5 (Sentence Boundaries) |
| Key challenge | Disambiguating period as abbreviation vs. sentence terminator |
| ATerm | . — may or may not end a sentence |
| STerm | ! ? — almost always end a sentence |
| Python ICU | BreakIterator.createSentenceInstance(Locale("en_US")) |
| NLP alternative | NLTK Punkt, spaCy sentencizer (corpus-trained) |
| Practical note | No algorithm is 100% accurate — ambiguity is inherent |
Termes associés
Plus dans Algorithmes
Algorithme déterminant l'ordre d'affichage des caractères dans un texte à direction mixte …
Algorithme standard de comparaison et de tri de chaînes Unicode via une …
Règles déterminant où le texte peut passer à la ligne suivante, en …
Mapping characters to a common case form for case-insensitive comparison. More comprehensive …
Caractères exclus de la composition canonique (NFC) pour éviter la décomposition des …
La position entre les mots selon les règles de coupure de mots …
Rules (UAX#29) for determining where one user-perceived character ends and another begins. …
Forme de normalisation C : décomposer puis recomposer canoniquement, produisant la forme …
Forme de normalisation D : décomposition complète sans recomposition. Utilisée par le …
Forme de normalisation KC : décomposition de compatibilité puis composition canonique. Fusionne …