Алгоритмы

Граница предложения

Позиция между предложениями по правилам Unicode. Сложнее разделения по точкам — учитывает сокращения (Mr.), многоточие (...) и десятичные точки (3.14).

· Updated

The Ambiguous Period Problem

Detecting sentence boundaries sounds simple: look for ., !, or ? followed by a space and a capital letter. But this heuristic fails immediately in real text. Consider:

  • Dr. Smith arrived. — period after "Dr" is an abbreviation, not a sentence end
  • The price is $3.14. — period in a decimal number, plus a terminal period
  • U.S.A. was founded... — initials and ellipsis
  • "Really?" she asked. — period inside quotation marks
  • He said (see Fig. 3) that... — parenthetical abbreviation

UAX #29 Sentence Boundary rules handle these cases by examining context around potential boundary characters, using character properties and look-ahead/look-behind to disambiguate.

UAX #29 Sentence Boundary Algorithm

The algorithm assigns characters to sentence break properties and applies rules to find boundaries. Key properties:

Property Examples Role
STerm . ! ? Potential sentence terminators
ATerm . Terminator that may be an abbreviation
Upper A–Z Uppercase letters (help detect abbreviations)
Lower a–z Lowercase letters
Close ) " ' Closing punctuation after terminal
Sp spaces Post-terminal spacing
SContinue , ; : Continuation characters (no break here)
Sep newlines Paragraph separators

The critical distinction between STerm and ATerm: ! and ? are STerm only — they almost always end sentences. . is ATerm — it may end a sentence OR be part of an abbreviation or decimal number. The algorithm uses follow-up rules:

  1. If an ATerm (.) is followed by an uppercase letter, it could be an abbreviation (Dr. Smith) or the start of a new sentence — context matters.
  2. If an ATerm is followed by a lowercase letter, it is probably an abbreviation — no break.
  3. If an ATerm is followed by space(s) then an uppercase letter, it is probably a sentence end — boundary.

Python Sentence Segmentation

# Using ICU (most accurate UAX #29 compliance)
from icu import BreakIterator, Locale

text = "Dr. Smith arrived at 3.14 PM. He said hello. U.S.A. was mentioned."
bi = BreakIterator.createSentenceInstance(Locale("en_US"))
bi.setText(text)

start = 0
for end in bi:
    sentence = text[start:end]
    print(repr(sentence))
    start = end
# "Dr. Smith arrived at 3.14 PM. "
# "He said hello. "
# "U.S.A. was mentioned."

# Using nltk for NLP-quality sentence tokenization
import nltk
sentences = nltk.sent_tokenize(text)
# NLTK uses a Punkt model trained on corpus data
# More accurate for abbreviations than pure rule-based UAX #29

NLP Applications

Sentence boundary detection (also called sentence segmentation or SBD) is a prerequisite for many NLP tasks: - Machine translation (translation is sentence-level) - Summarization (sentence scoring) - Named entity recognition (context window) - Sentiment analysis (per-sentence or per-review)

For high-accuracy NLP, corpus-trained models (like NLTK's Punkt or spaCy's sentencizer) typically outperform pure rule-based UAX #29 because they learn abbreviation lists from training data. UAX #29 provides a principled baseline that works without training data.

Quick Facts

Property Value
Specification UAX #29, Section 5 (Sentence Boundaries)
Key challenge Disambiguating period as abbreviation vs. sentence terminator
ATerm . — may or may not end a sentence
STerm ! ? — almost always end a sentence
Python ICU BreakIterator.createSentenceInstance(Locale("en_US"))
NLP alternative NLTK Punkt, spaCy sentencizer (corpus-trained)
Practical note No algorithm is 100% accurate — ambiguity is inherent

Связанные термины

Ещё в Алгоритмы

Case Folding

Mapping characters to a common case form for case-insensitive comparison. More comprehensive …

Grapheme Cluster Boundary

Rules (UAX#29) for determining where one user-perceived character ends and another begins. …

NFC (Canonical Composition)

Normalization Form C: декомпозиция с последующей канонической рекомпозицией, дающая кратчайшую форму. Рекомендуется …

NFD (Canonical Decomposition)

Normalization Form D: полная декомпозиция без рекомпозиции. Используется файловой системой macOS HFS+. …

NFKC (Compatibility Composition)

Normalization Form KC: совместимая декомпозиция с последующей канонической композицией. Объединяет визуально похожие …

NFKD (Compatibility Decomposition)

Normalization Form KD: совместимая декомпозиция без рекомпозиции. Самая агрессивная нормализация с максимальной …

String Comparison

Comparing Unicode strings requires normalization (NFC/NFD) and optionally collation (locale-aware sorting). Binary …

Алгоритм переноса строки

Правила определения мест переноса текста на следующую строку с учетом свойств символов, …

Алгоритм сортировки

Стандартный алгоритм сравнения и сортировки строк Unicode с многоуровневым сравнением: базовый символ …

Граница слова

Позиция между словами согласно правилам Unicode. Не простое разделение по пробелам — …