문장 경계
유니코드 규칙에 따른 문장 사이의 위치. 마침표로만 분리하는 것보다 복잡하며, 약어(Mr.), 생략 부호(...), 소수점(3.14) 등을 처리합니다.
The Ambiguous Period Problem
Detecting sentence boundaries sounds simple: look for ., !, or ? followed by a space and a capital letter. But this heuristic fails immediately in real text. Consider:
Dr. Smith arrived.— period after "Dr" is an abbreviation, not a sentence endThe price is $3.14.— period in a decimal number, plus a terminal periodU.S.A. was founded...— initials and ellipsis"Really?" she asked.— period inside quotation marksHe said (see Fig. 3) that...— parenthetical abbreviation
UAX #29 Sentence Boundary rules handle these cases by examining context around potential boundary characters, using character properties and look-ahead/look-behind to disambiguate.
UAX #29 Sentence Boundary Algorithm
The algorithm assigns characters to sentence break properties and applies rules to find boundaries. Key properties:
| Property | Examples | Role |
|---|---|---|
| STerm | . ! ? |
Potential sentence terminators |
| ATerm | . |
Terminator that may be an abbreviation |
| Upper | A–Z | Uppercase letters (help detect abbreviations) |
| Lower | a–z | Lowercase letters |
| Close | ) " ' |
Closing punctuation after terminal |
| Sp | spaces | Post-terminal spacing |
| SContinue | , ; : |
Continuation characters (no break here) |
| Sep | newlines | Paragraph separators |
The critical distinction between STerm and ATerm: ! and ? are STerm only — they almost always end sentences. . is ATerm — it may end a sentence OR be part of an abbreviation or decimal number. The algorithm uses follow-up rules:
- If an ATerm (
.) is followed by an uppercase letter, it could be an abbreviation (Dr. Smith) or the start of a new sentence — context matters. - If an ATerm is followed by a lowercase letter, it is probably an abbreviation — no break.
- If an ATerm is followed by space(s) then an uppercase letter, it is probably a sentence end — boundary.
Python Sentence Segmentation
# Using ICU (most accurate UAX #29 compliance)
from icu import BreakIterator, Locale
text = "Dr. Smith arrived at 3.14 PM. He said hello. U.S.A. was mentioned."
bi = BreakIterator.createSentenceInstance(Locale("en_US"))
bi.setText(text)
start = 0
for end in bi:
sentence = text[start:end]
print(repr(sentence))
start = end
# "Dr. Smith arrived at 3.14 PM. "
# "He said hello. "
# "U.S.A. was mentioned."
# Using nltk for NLP-quality sentence tokenization
import nltk
sentences = nltk.sent_tokenize(text)
# NLTK uses a Punkt model trained on corpus data
# More accurate for abbreviations than pure rule-based UAX #29
NLP Applications
Sentence boundary detection (also called sentence segmentation or SBD) is a prerequisite for many NLP tasks: - Machine translation (translation is sentence-level) - Summarization (sentence scoring) - Named entity recognition (context window) - Sentiment analysis (per-sentence or per-review)
For high-accuracy NLP, corpus-trained models (like NLTK's Punkt or spaCy's sentencizer) typically outperform pure rule-based UAX #29 because they learn abbreviation lists from training data. UAX #29 provides a principled baseline that works without training data.
Quick Facts
| Property | Value |
|---|---|
| Specification | UAX #29, Section 5 (Sentence Boundaries) |
| Key challenge | Disambiguating period as abbreviation vs. sentence terminator |
| ATerm | . — may or may not end a sentence |
| STerm | ! ? — almost always end a sentence |
| Python ICU | BreakIterator.createSentenceInstance(Locale("en_US")) |
| NLP alternative | NLTK Punkt, spaCy sentencizer (corpus-trained) |
| Practical note | No algorithm is 100% accurate — ambiguity is inherent |
관련 용어
알고리즘의 더 많은 용어
Mapping characters to a common case form for case-insensitive comparison. More comprehensive …
Rules (UAX#29) for determining where one user-perceived character ends and another begins. …
정규화 형식 C: 분해 후 정규 재합성하여 가장 짧은 형식을 생성합니다. 데이터 …
정규화 형식 D: 재합성 없이 완전히 분해합니다. macOS HFS+ 파일 시스템에서 사용됩니다. …
정규화 형식 KC: 호환 분해 후 정규 합성. 시각적으로 유사한 문자를 통합합니다(fi→fi, …
정규화 형식 KD: 재합성 없이 호환 분해. 가장 강력한 정규화 방식으로 서식 …
Comparing Unicode strings requires normalization (NFC/NFD) and optionally collation (locale-aware sorting). Binary …
유니코드 단어 경계 규칙에 따라 결정된 단어 사이의 위치. 단순히 공백으로 분리하는 …
문자 양방향 범주와 명시적 방향 재정의를 사용하여 혼합 방향 텍스트(예: 영어 + …
유니코드 텍스트를 표준 정규 형식으로 변환하는 과정. 네 가지 형식: NFC(합성), NFD(분해), …