文境界
Unicodeルールによる文間の位置。単純なピリオド分割より複雑で、略語(Mr.)・省略記号(...)・小数点(3.14)などを処理します。
The Ambiguous Period Problem
Detecting sentence boundaries sounds simple: look for ., !, or ? followed by a space and a capital letter. But this heuristic fails immediately in real text. Consider:
Dr. Smith arrived.— period after "Dr" is an abbreviation, not a sentence endThe price is $3.14.— period in a decimal number, plus a terminal periodU.S.A. was founded...— initials and ellipsis"Really?" she asked.— period inside quotation marksHe said (see Fig. 3) that...— parenthetical abbreviation
UAX #29 Sentence Boundary rules handle these cases by examining context around potential boundary characters, using character properties and look-ahead/look-behind to disambiguate.
UAX #29 Sentence Boundary Algorithm
The algorithm assigns characters to sentence break properties and applies rules to find boundaries. Key properties:
| Property | Examples | Role |
|---|---|---|
| STerm | . ! ? |
Potential sentence terminators |
| ATerm | . |
Terminator that may be an abbreviation |
| Upper | A–Z | Uppercase letters (help detect abbreviations) |
| Lower | a–z | Lowercase letters |
| Close | ) " ' |
Closing punctuation after terminal |
| Sp | spaces | Post-terminal spacing |
| SContinue | , ; : |
Continuation characters (no break here) |
| Sep | newlines | Paragraph separators |
The critical distinction between STerm and ATerm: ! and ? are STerm only — they almost always end sentences. . is ATerm — it may end a sentence OR be part of an abbreviation or decimal number. The algorithm uses follow-up rules:
- If an ATerm (
.) is followed by an uppercase letter, it could be an abbreviation (Dr. Smith) or the start of a new sentence — context matters. - If an ATerm is followed by a lowercase letter, it is probably an abbreviation — no break.
- If an ATerm is followed by space(s) then an uppercase letter, it is probably a sentence end — boundary.
Python Sentence Segmentation
# Using ICU (most accurate UAX #29 compliance)
from icu import BreakIterator, Locale
text = "Dr. Smith arrived at 3.14 PM. He said hello. U.S.A. was mentioned."
bi = BreakIterator.createSentenceInstance(Locale("en_US"))
bi.setText(text)
start = 0
for end in bi:
sentence = text[start:end]
print(repr(sentence))
start = end
# "Dr. Smith arrived at 3.14 PM. "
# "He said hello. "
# "U.S.A. was mentioned."
# Using nltk for NLP-quality sentence tokenization
import nltk
sentences = nltk.sent_tokenize(text)
# NLTK uses a Punkt model trained on corpus data
# More accurate for abbreviations than pure rule-based UAX #29
NLP Applications
Sentence boundary detection (also called sentence segmentation or SBD) is a prerequisite for many NLP tasks: - Machine translation (translation is sentence-level) - Summarization (sentence scoring) - Named entity recognition (context window) - Sentiment analysis (per-sentence or per-review)
For high-accuracy NLP, corpus-trained models (like NLTK's Punkt or spaCy's sentencizer) typically outperform pure rule-based UAX #29 because they learn abbreviation lists from training data. UAX #29 provides a principled baseline that works without training data.
Quick Facts
| Property | Value |
|---|---|
| Specification | UAX #29, Section 5 (Sentence Boundaries) |
| Key challenge | Disambiguating period as abbreviation vs. sentence terminator |
| ATerm | . — may or may not end a sentence |
| STerm | ! ? — almost always end a sentence |
| Python ICU | BreakIterator.createSentenceInstance(Locale("en_US")) |
| NLP alternative | NLTK Punkt, spaCy sentencizer (corpus-trained) |
| Practical note | No algorithm is 100% accurate — ambiguity is inherent |
関連用語
アルゴリズム のその他の用語
Mapping characters to a common case form for case-insensitive comparison. More comprehensive …
Rules (UAX#29) for determining where one user-perceived character ends and another begins. …
正規化形式C:分解してから正規再合成し、最短の形式を生成します。データの保存と交換に推奨されており、Webの標準形式です。
正規化形式D:再合成せずに完全分解します。macOSのHFS+ファイルシステムで使われます。é(U+00E9)→ e + ◌́(U+0065 + U+0301)。
正規化形式KC:互換分解後に正規合成。視覚的に類似した文字を統合します(fi→fi、²→2、Ⅳ→IV)。識別子の比較に使われます。
正規化形式KD:再合成せずに互換分解。最も強力な正規化で、最も多くの書式情報を失います。
Comparing Unicode strings requires normalization (NFC/NFD) and optionally collation (locale-aware sorting). Binary …
テキストの境界を見つけるアルゴリズム:書記素クラスター・単語・文境界。カーソル移動・テキスト選択・テキスト処理に不可欠です。
文字の双方向カテゴリと明示的な方向オーバーライドを使って、混在方向テキスト(例:英語+アラビア語)の表示順序を決定するアルゴリズム。
Unicodeテキストを標準的な正規形に変換するプロセス。4つの形式:NFC(合成)、NFD(分解)、NFKC(互換合成)、NFKD(互換分解)。