Unicode 行分割アルゴリズム
文字プロパティ・CJK単語境界・改行機会を考慮して、テキストを次の行に折り返せる位置を決定するルール。
Breaking Text Across Lines
When a paragraph of text is too wide for its container, a rendering engine must decide where to insert line breaks. This sounds straightforward but involves subtle rules that vary by script, language, and context. The Unicode Line Breaking Algorithm (UAX #14) provides a systematic framework for finding valid break opportunities.
The algorithm does not tell renderers where to break — that depends on line width and font metrics. Instead, it classifies each position between characters as a mandatory break, a break opportunity (the renderer may break here), or no break (breaking here is not allowed).
Line Break Classes
Each Unicode character is assigned a line break class. There are over 40 classes; the most important:
| Class | Code | Examples | Behavior |
|---|---|---|---|
| Break After | BA | spaces, hyphens | Break opportunity after |
| Break Before | BB | em dash in some contexts | Break opportunity before |
| Alphabetic | AL | Latin, Cyrillic letters | No break between letters |
| Ideographic | ID | CJK unified ideographs | Break opportunity after each |
| Nonstarter | NS | Japanese small kana (っ, ょ) | Cannot start a new line |
| Close Punctuation | CL | ), ], » |
No break before (attaches to preceding) |
| Open Punctuation | OP | (, [, « |
No break after (attaches to following) |
| Mandatory Break | BK | U+000C FORM FEED | Must break here |
| Carriage Return | CR | U+000D | Mandatory break (with LF handling) |
| Glue | GL | U+00A0 NO-BREAK SPACE | No break allowed |
The key insight for international text: CJK ideographic characters (Chinese, Japanese, Korean) have a break opportunity after almost every character because those scripts historically had no spaces. Latin letters have no break opportunities between them — only at space or hyphen positions.
Connection to CSS
The UAX #14 algorithm is the foundation for CSS line-breaking properties:
/* Allow break at any character (overrides UAX #14 for CJK) */
word-break: break-all;
/* Break long words that would overflow */
overflow-wrap: break-word;
/* Strict CJK line breaking (no breaks before small kana) */
line-break: strict;
/* Loose CJK line breaking (more break opportunities) */
line-break: loose;
/* Prevent breaking across lines entirely */
white-space: nowrap;
Python and Line Breaking
Python's textwrap module uses a simplified version of line breaking for ASCII-centric text. For full UAX #14 compliance, the uharfbuzz or PyICU libraries provide proper implementations:
import textwrap
# Basic wrapping (works for Latin text)
text = "The quick brown fox jumps over the lazy dog."
print(textwrap.fill(text, width=30))
# For CJK text, textwrap.fill does NOT insert breaks
# between ideographs — use a proper UAX #14 implementation
# (uharfbuzz, or browser rendering)
Quick Facts
| Property | Value |
|---|---|
| Specification | Unicode Standard Annex #14 (UAX #14) |
| Number of line break classes | 43 |
| CJK default | Break opportunity after every ideograph |
| Latin default | No break except at spaces/hyphens |
| CSS connection | word-break, overflow-wrap, line-break |
| Mandatory break chars | U+000A LF, U+000D CR, U+000C FF, U+0085 NEL, U+2028 LS, U+2029 PS |
| No-break space | U+00A0 has class GL (Glue) — prevents break at that position |
関連用語
アルゴリズム のその他の用語
Mapping characters to a common case form for case-insensitive comparison. More comprehensive …
Rules (UAX#29) for determining where one user-perceived character ends and another begins. …
正規化形式C:分解してから正規再合成し、最短の形式を生成します。データの保存と交換に推奨されており、Webの標準形式です。
正規化形式D:再合成せずに完全分解します。macOSのHFS+ファイルシステムで使われます。é(U+00E9)→ e + ◌́(U+0065 + U+0301)。
正規化形式KC:互換分解後に正規合成。視覚的に類似した文字を統合します(fi→fi、²→2、Ⅳ→IV)。識別子の比較に使われます。
正規化形式KD:再合成せずに互換分解。最も強力な正規化で、最も多くの書式情報を失います。
Comparing Unicode strings requires normalization (NFC/NFD) and optionally collation (locale-aware sorting). Binary …
テキストの境界を見つけるアルゴリズム:書記素クラスター・単語・文境界。カーソル移動・テキスト選択・テキスト処理に不可欠です。
文字の双方向カテゴリと明示的な方向オーバーライドを使って、混在方向テキスト(例:英語+アラビア語)の表示順序を決定するアルゴリズム。
Unicodeテキストを標準的な正規形に変換するプロセス。4つの形式:NFC(合成)、NFD(分解)、NFKC(互換合成)、NFKD(互換分解)。