What is 유니코드 텍스트 분절?

텍스트의 경계를 찾는 알고리즘: 문자소 클러스터, 단어, 문장 경계. 커서 이동, 텍스트 선택, 텍스트 처리에 필수적입니다.

What is 단어 경계?

유니코드 단어 경계 규칙에 따라 결정된 단어 사이의 위치. 단순히 공백으로 분리하는 것이 아니라 CJK(공백 없음), 줄임말, 숫자를 올바르게 처리합니다.

알고리즘

문장 경계

유니코드 규칙에 따른 문장 사이의 위치. 마침표로만 분리하는 것보다 복잡하며, 약어(Mr.), 생략 부호(...), 소수점(3.14) 등을 처리합니다.

2022-11-14 · 수정일 2024-09-15

The Ambiguous Period Problem

Detecting sentence boundaries sounds simple: look for ., !, or ? followed by a space and a capital letter. But this heuristic fails immediately in real text. Consider:

Dr. Smith arrived. — period after "Dr" is an abbreviation, not a sentence end
The price is $3.14. — period in a decimal number, plus a terminal period
U.S.A. was founded... — initials and ellipsis
"Really?" she asked. — period inside quotation marks
He said (see Fig. 3) that... — parenthetical abbreviation

UAX #29 Sentence Boundary rules handle these cases by examining context around potential boundary characters, using character properties and look-ahead/look-behind to disambiguate.

UAX #29 Sentence Boundary Algorithm

The algorithm assigns characters to sentence break properties and applies rules to find boundaries. Key properties:

Property	Examples	Role
STerm	`.` `!` `?`	Potential sentence terminators
ATerm	`.`	Terminator that may be an abbreviation
Upper	A–Z	Uppercase letters (help detect abbreviations)
Lower	a–z	Lowercase letters
Close	`)` `"` `'`	Closing punctuation after terminal
Sp	spaces	Post-terminal spacing
SContinue	`,` `;` `:`	Continuation characters (no break here)
Sep	newlines	Paragraph separators

The critical distinction between STerm and ATerm: ! and ? are STerm only — they almost always end sentences. . is ATerm — it may end a sentence OR be part of an abbreviation or decimal number. The algorithm uses follow-up rules:

If an ATerm (.) is followed by an uppercase letter, it could be an abbreviation (Dr. Smith) or the start of a new sentence — context matters.
If an ATerm is followed by a lowercase letter, it is probably an abbreviation — no break.
If an ATerm is followed by space(s) then an uppercase letter, it is probably a sentence end — boundary.

Python Sentence Segmentation

# Using ICU (most accurate UAX #29 compliance)
from icu import BreakIterator, Locale

text = "Dr. Smith arrived at 3.14 PM. He said hello. U.S.A. was mentioned."
bi = BreakIterator.createSentenceInstance(Locale("en_US"))
bi.setText(text)

start = 0
for end in bi:
    sentence = text[start:end]
    print(repr(sentence))
    start = end
# "Dr. Smith arrived at 3.14 PM. "
# "He said hello. "
# "U.S.A. was mentioned."

# Using nltk for NLP-quality sentence tokenization
import nltk
sentences = nltk.sent_tokenize(text)
# NLTK uses a Punkt model trained on corpus data
# More accurate for abbreviations than pure rule-based UAX #29

NLP Applications

Sentence boundary detection (also called sentence segmentation or SBD) is a prerequisite for many NLP tasks: - Machine translation (translation is sentence-level) - Summarization (sentence scoring) - Named entity recognition (context window) - Sentiment analysis (per-sentence or per-review)

For high-accuracy NLP, corpus-trained models (like NLTK's Punkt or spaCy's sentencizer) typically outperform pure rule-based UAX #29 because they learn abbreviation lists from training data. UAX #29 provides a principled baseline that works without training data.

Quick Facts

Property	Value
Specification	UAX #29, Section 5 (Sentence Boundaries)
Key challenge	Disambiguating period as abbreviation vs. sentence terminator
ATerm	`.` — may or may not end a sentence
STerm	`!` `?` — almost always end a sentence
Python ICU	`BreakIterator.createSentenceInstance(Locale("en_US"))`
NLP alternative	NLTK Punkt, spaCy sentencizer (corpus-trained)
Practical note	No algorithm is 100% accurate — ambiguity is inherent