What is Segmentação de texto?

Algoritmos para encontrar limites no texto: limites de cluster de grafemas, palavras e sentenças. Essencial para movimentação do cursor, seleção de texto e processamento de texto.

What is Limite de palavra?

A posição entre palavras determinada pelas regras de quebra de palavras do Unicode. Não é uma simples divisão por espaços — trata CJK (sem espaços), contrações e números corretamente.

Algoritmos

Limite de frase

A posição entre sentenças conforme as regras do Unicode. Mais complexo do que simplesmente dividir por pontos — trata abreviações (Sr.), reticências (...) e pontos decimais (3,14).

2022-11-14 · Updated 2024-09-15

The Ambiguous Period Problem

Detecting sentence boundaries sounds simple: look for ., !, or ? followed by a space and a capital letter. But this heuristic fails immediately in real text. Consider:

Dr. Smith arrived. — period after "Dr" is an abbreviation, not a sentence end
The price is $3.14. — period in a decimal number, plus a terminal period
U.S.A. was founded... — initials and ellipsis
"Really?" she asked. — period inside quotation marks
He said (see Fig. 3) that... — parenthetical abbreviation

UAX #29 Sentence Boundary rules handle these cases by examining context around potential boundary characters, using character properties and look-ahead/look-behind to disambiguate.

UAX #29 Sentence Boundary Algorithm

The algorithm assigns characters to sentence break properties and applies rules to find boundaries. Key properties:

Property	Examples	Role
STerm	`.` `!` `?`	Potential sentence terminators
ATerm	`.`	Terminator that may be an abbreviation
Upper	A–Z	Uppercase letters (help detect abbreviations)
Lower	a–z	Lowercase letters
Close	`)` `"` `'`	Closing punctuation after terminal
Sp	spaces	Post-terminal spacing
SContinue	`,` `;` `:`	Continuation characters (no break here)
Sep	newlines	Paragraph separators

The critical distinction between STerm and ATerm: ! and ? are STerm only — they almost always end sentences. . is ATerm — it may end a sentence OR be part of an abbreviation or decimal number. The algorithm uses follow-up rules:

If an ATerm (.) is followed by an uppercase letter, it could be an abbreviation (Dr. Smith) or the start of a new sentence — context matters.
If an ATerm is followed by a lowercase letter, it is probably an abbreviation — no break.
If an ATerm is followed by space(s) then an uppercase letter, it is probably a sentence end — boundary.

Python Sentence Segmentation

# Using ICU (most accurate UAX #29 compliance)
from icu import BreakIterator, Locale

text = "Dr. Smith arrived at 3.14 PM. He said hello. U.S.A. was mentioned."
bi = BreakIterator.createSentenceInstance(Locale("en_US"))
bi.setText(text)

start = 0
for end in bi:
    sentence = text[start:end]
    print(repr(sentence))
    start = end
# "Dr. Smith arrived at 3.14 PM. "
# "He said hello. "
# "U.S.A. was mentioned."

# Using nltk for NLP-quality sentence tokenization
import nltk
sentences = nltk.sent_tokenize(text)
# NLTK uses a Punkt model trained on corpus data
# More accurate for abbreviations than pure rule-based UAX #29

NLP Applications

Sentence boundary detection (also called sentence segmentation or SBD) is a prerequisite for many NLP tasks: - Machine translation (translation is sentence-level) - Summarization (sentence scoring) - Named entity recognition (context window) - Sentiment analysis (per-sentence or per-review)

For high-accuracy NLP, corpus-trained models (like NLTK's Punkt or spaCy's sentencizer) typically outperform pure rule-based UAX #29 because they learn abbreviation lists from training data. UAX #29 provides a principled baseline that works without training data.

Quick Facts

Property	Value
Specification	UAX #29, Section 5 (Sentence Boundaries)
Key challenge	Disambiguating period as abbreviation vs. sentence terminator
ATerm	`.` — may or may not end a sentence
STerm	`!` `?` — almost always end a sentence
Python ICU	`BreakIterator.createSentenceInstance(Locale("en_US"))`
NLP alternative	NLTK Punkt, spaCy sentencizer (corpus-trained)
Practical note	No algorithm is 100% accurate — ambiguity is inherent

Termos Relacionados

Segmentação de texto Limite de palavra

Mais em Algoritmos

Algoritmo bidirecional

Algoritmo que determina a ordem de exibição dos caracteres em texto com …

Algoritmo de ordenação

Algoritmo padrão para comparar e ordenar strings Unicode usando comparação em múltiplos …

Algoritmo de quebra de linha

Regras para determinar onde o texto pode quebrar para a próxima linha, …

Case Folding

Mapping characters to a common case form for case-insensitive comparison. More comprehensive …

Exclusão de composição

Caracteres excluídos da composição canônica (NFC) para evitar a decomposição de não-iniciadores …

Grapheme Cluster Boundary

Rules (UAX#29) for determining where one user-perceived character ends and another begins. …

Limite de palavra

A posição entre palavras determinada pelas regras de quebra de palavras do …

NFC (Canonical Composition)

Forma de Normalização C: decompor e depois recompor canonicamente, produzindo a forma …

NFD (Canonical Decomposition)

Forma de Normalização D: decomposição completa sem recomposição. Usada pelo sistema de …

NFKC (Compatibility Composition)

Forma de Normalização KC: decomposição de compatibilidade seguida de composição canônica. Mescla …

← Voltar ao Glossário