문장 경계
Embed This Widget
Add the script tag and a data attribute to embed this widget.
Embed via iframe for maximum compatibility.
<iframe src="https://unicodefyi.com/iframe/glossary/sentence-boundary/" width="420" height="400" frameborder="0" style="border:0;border-radius:10px;max-width:100%" loading="lazy"></iframe>
Paste this URL in WordPress, Medium, or any oEmbed-compatible platform.
https://unicodefyi.com/glossary/sentence-boundary/
Add a dynamic SVG badge to your README or docs.
[](https://unicodefyi.com/glossary/sentence-boundary/)
Use the native HTML custom element.
유니코드 규칙에 따른 문장 사이의 위치. 마침표로만 분리하는 것보다 복잡하며, 약어(Mr.), 생략 부호(...), 소수점(3.14) 등을 처리합니다.
The Ambiguous Period Problem
Detecting sentence boundaries sounds simple: look for ., !, or ? followed by a space and a capital letter. But this heuristic fails immediately in real text. Consider:
Dr. Smith arrived.— period after "Dr" is an abbreviation, not a sentence endThe price is $3.14.— period in a decimal number, plus a terminal periodU.S.A. was founded...— initials and ellipsis"Really?" she asked.— period inside quotation marksHe said (see Fig. 3) that...— parenthetical abbreviation
UAX #29 Sentence Boundary rules handle these cases by examining context around potential boundary characters, using character properties and look-ahead/look-behind to disambiguate.
UAX #29 Sentence Boundary Algorithm
The algorithm assigns characters to sentence break properties and applies rules to find boundaries. Key properties:
| Property | Examples | Role |
|---|---|---|
| STerm | . ! ? |
Potential sentence terminators |
| ATerm | . |
Terminator that may be an abbreviation |
| Upper | A–Z | Uppercase letters (help detect abbreviations) |
| Lower | a–z | Lowercase letters |
| Close | ) " ' |
Closing punctuation after terminal |
| Sp | spaces | Post-terminal spacing |
| SContinue | , ; : |
Continuation characters (no break here) |
| Sep | newlines | Paragraph separators |
The critical distinction between STerm and ATerm: ! and ? are STerm only — they almost always end sentences. . is ATerm — it may end a sentence OR be part of an abbreviation or decimal number. The algorithm uses follow-up rules:
- If an ATerm (
.) is followed by an uppercase letter, it could be an abbreviation (Dr. Smith) or the start of a new sentence — context matters. - If an ATerm is followed by a lowercase letter, it is probably an abbreviation — no break.
- If an ATerm is followed by space(s) then an uppercase letter, it is probably a sentence end — boundary.
Python Sentence Segmentation
# Using ICU (most accurate UAX #29 compliance)
from icu import BreakIterator, Locale
text = "Dr. Smith arrived at 3.14 PM. He said hello. U.S.A. was mentioned."
bi = BreakIterator.createSentenceInstance(Locale("en_US"))
bi.setText(text)
start = 0
for end in bi:
sentence = text[start:end]
print(repr(sentence))
start = end
# "Dr. Smith arrived at 3.14 PM. "
# "He said hello. "
# "U.S.A. was mentioned."
# Using nltk for NLP-quality sentence tokenization
import nltk
sentences = nltk.sent_tokenize(text)
# NLTK uses a Punkt model trained on corpus data
# More accurate for abbreviations than pure rule-based UAX #29
NLP Applications
Sentence boundary detection (also called sentence segmentation or SBD) is a prerequisite for many NLP tasks: - Machine translation (translation is sentence-level) - Summarization (sentence scoring) - Named entity recognition (context window) - Sentiment analysis (per-sentence or per-review)
For high-accuracy NLP, corpus-trained models (like NLTK's Punkt or spaCy's sentencizer) typically outperform pure rule-based UAX #29 because they learn abbreviation lists from training data. UAX #29 provides a principled baseline that works without training data.
Quick Facts
| Property | Value |
|---|---|
| Specification | UAX #29, Section 5 (Sentence Boundaries) |
| Key challenge | Disambiguating period as abbreviation vs. sentence terminator |
| ATerm | . — may or may not end a sentence |
| STerm | ! ? — almost always end a sentence |
| Python ICU | BreakIterator.createSentenceInstance(Locale("en_US")) |
| NLP alternative | NLTK Punkt, spaCy sentencizer (corpus-trained) |
| Practical note | No algorithm is 100% accurate — ambiguity is inherent |
관련 용어
알고리즘의 더 많은 용어
Mapping characters to a common case form for case-insensitive comparison. More comprehensive …
Rules (UAX#29) for determining where one user-perceived character ends and another begins. …
정규화 형식 C: 분해 후 정규 재합성하여 가장 짧은 형식을 생성합니다. 데이터 …
정규화 형식 D: 재합성 없이 완전히 분해합니다. macOS HFS+ 파일 시스템에서 사용됩니다. …
정규화 형식 KC: 호환 분해 후 정규 합성. 시각적으로 유사한 문자를 통합합니다(fi→fi, …
정규화 형식 KD: 재합성 없이 호환 분해. 가장 강력한 정규화 방식으로 서식 …
Comparing Unicode strings requires normalization (NFC/NFD) and optionally collation (locale-aware sorting). Binary …
유니코드 단어 경계 규칙에 따라 결정된 단어 사이의 위치. 단순히 공백으로 분리하는 …
문자 양방향 범주와 명시적 방향 재정의를 사용하여 혼합 방향 텍스트(예: 영어 + …
유니코드 텍스트를 표준 정규 형식으로 변환하는 과정. 네 가지 형식: NFC(합성), NFD(분해), …