What is Segmentation de texte?

Algorithmes permettant de trouver les limites dans un texte : limites de groupe de graphèmes, de mots et de phrases. Essentiel pour le déplacement du curseur, la sélection de texte et le traitement de texte.

What is Frontière de mot?

La position entre les mots selon les règles de coupure de mots Unicode. Plus simple qu'un simple découpage sur les espaces — gère correctement le CJK (sans espaces), les contractions et les nombres.

Algorithmes

Frontière de phrase

La position entre les phrases selon les règles Unicode. Plus complexe qu'un simple découpage sur les points — gère les abréviations (M.), les points de suspension (...) et les nombres décimaux (3,14).

2022-11-14 · Updated 2024-09-15

The Ambiguous Period Problem

Detecting sentence boundaries sounds simple: look for ., !, or ? followed by a space and a capital letter. But this heuristic fails immediately in real text. Consider:

Dr. Smith arrived. — period after "Dr" is an abbreviation, not a sentence end
The price is $3.14. — period in a decimal number, plus a terminal period
U.S.A. was founded... — initials and ellipsis
"Really?" she asked. — period inside quotation marks
He said (see Fig. 3) that... — parenthetical abbreviation

UAX #29 Sentence Boundary rules handle these cases by examining context around potential boundary characters, using character properties and look-ahead/look-behind to disambiguate.

UAX #29 Sentence Boundary Algorithm

The algorithm assigns characters to sentence break properties and applies rules to find boundaries. Key properties:

Property	Examples	Role
STerm	`.` `!` `?`	Potential sentence terminators
ATerm	`.`	Terminator that may be an abbreviation
Upper	A–Z	Uppercase letters (help detect abbreviations)
Lower	a–z	Lowercase letters
Close	`)` `"` `'`	Closing punctuation after terminal
Sp	spaces	Post-terminal spacing
SContinue	`,` `;` `:`	Continuation characters (no break here)
Sep	newlines	Paragraph separators

The critical distinction between STerm and ATerm: ! and ? are STerm only — they almost always end sentences. . is ATerm — it may end a sentence OR be part of an abbreviation or decimal number. The algorithm uses follow-up rules:

If an ATerm (.) is followed by an uppercase letter, it could be an abbreviation (Dr. Smith) or the start of a new sentence — context matters.
If an ATerm is followed by a lowercase letter, it is probably an abbreviation — no break.
If an ATerm is followed by space(s) then an uppercase letter, it is probably a sentence end — boundary.

Python Sentence Segmentation

# Using ICU (most accurate UAX #29 compliance)
from icu import BreakIterator, Locale

text = "Dr. Smith arrived at 3.14 PM. He said hello. U.S.A. was mentioned."
bi = BreakIterator.createSentenceInstance(Locale("en_US"))
bi.setText(text)

start = 0
for end in bi:
    sentence = text[start:end]
    print(repr(sentence))
    start = end
# "Dr. Smith arrived at 3.14 PM. "
# "He said hello. "
# "U.S.A. was mentioned."

# Using nltk for NLP-quality sentence tokenization
import nltk
sentences = nltk.sent_tokenize(text)
# NLTK uses a Punkt model trained on corpus data
# More accurate for abbreviations than pure rule-based UAX #29

NLP Applications

Sentence boundary detection (also called sentence segmentation or SBD) is a prerequisite for many NLP tasks: - Machine translation (translation is sentence-level) - Summarization (sentence scoring) - Named entity recognition (context window) - Sentiment analysis (per-sentence or per-review)

For high-accuracy NLP, corpus-trained models (like NLTK's Punkt or spaCy's sentencizer) typically outperform pure rule-based UAX #29 because they learn abbreviation lists from training data. UAX #29 provides a principled baseline that works without training data.

Quick Facts

Property	Value
Specification	UAX #29, Section 5 (Sentence Boundaries)
Key challenge	Disambiguating period as abbreviation vs. sentence terminator
ATerm	`.` — may or may not end a sentence
STerm	`!` `?` — almost always end a sentence
Python ICU	`BreakIterator.createSentenceInstance(Locale("en_US"))`
NLP alternative	NLTK Punkt, spaCy sentencizer (corpus-trained)
Practical note	No algorithm is 100% accurate — ambiguity is inherent

Termes associés

Segmentation de texte Frontière de mot

Plus dans Algorithmes

Algorithme bidirectionnel

Algorithme déterminant l'ordre d'affichage des caractères dans un texte à direction mixte …

Algorithme de classement

Algorithme standard de comparaison et de tri de chaînes Unicode via une …

Algorithme de coupure de ligne

Règles déterminant où le texte peut passer à la ligne suivante, en …

Case Folding

Mapping characters to a common case form for case-insensitive comparison. More comprehensive …

Exclusion de composition

Caractères exclus de la composition canonique (NFC) pour éviter la décomposition des …

Frontière de mot

La position entre les mots selon les règles de coupure de mots …

Grapheme Cluster Boundary

Rules (UAX#29) for determining where one user-perceived character ends and another begins. …

NFC (Canonical Composition)

Forme de normalisation C : décomposer puis recomposer canoniquement, produisant la forme …

NFD (Canonical Decomposition)

Forme de normalisation D : décomposition complète sans recomposition. Utilisée par le …

NFKC (Compatibility Composition)

Forme de normalisation KC : décomposition de compatibilité puis composition canonique. Fusionne …

← Retour au glossaire