Algoritma pemisahan baris
Aturan untuk menentukan di mana teks dapat dibungkus ke baris berikutnya, mempertimbangkan properti karakter, batas kata CJK, dan peluang pemisah.
Breaking Text Across Lines
When a paragraph of text is too wide for its container, a rendering engine must decide where to insert line breaks. This sounds straightforward but involves subtle rules that vary by script, language, and context. The Unicode Line Breaking Algorithm (UAX #14) provides a systematic framework for finding valid break opportunities.
The algorithm does not tell renderers where to break — that depends on line width and font metrics. Instead, it classifies each position between characters as a mandatory break, a break opportunity (the renderer may break here), or no break (breaking here is not allowed).
Line Break Classes
Each Unicode character is assigned a line break class. There are over 40 classes; the most important:
| Class | Code | Examples | Behavior |
|---|---|---|---|
| Break After | BA | spaces, hyphens | Break opportunity after |
| Break Before | BB | em dash in some contexts | Break opportunity before |
| Alphabetic | AL | Latin, Cyrillic letters | No break between letters |
| Ideographic | ID | CJK unified ideographs | Break opportunity after each |
| Nonstarter | NS | Japanese small kana (っ, ょ) | Cannot start a new line |
| Close Punctuation | CL | ), ], » |
No break before (attaches to preceding) |
| Open Punctuation | OP | (, [, « |
No break after (attaches to following) |
| Mandatory Break | BK | U+000C FORM FEED | Must break here |
| Carriage Return | CR | U+000D | Mandatory break (with LF handling) |
| Glue | GL | U+00A0 NO-BREAK SPACE | No break allowed |
The key insight for international text: CJK ideographic characters (Chinese, Japanese, Korean) have a break opportunity after almost every character because those scripts historically had no spaces. Latin letters have no break opportunities between them — only at space or hyphen positions.
Connection to CSS
The UAX #14 algorithm is the foundation for CSS line-breaking properties:
/* Allow break at any character (overrides UAX #14 for CJK) */
word-break: break-all;
/* Break long words that would overflow */
overflow-wrap: break-word;
/* Strict CJK line breaking (no breaks before small kana) */
line-break: strict;
/* Loose CJK line breaking (more break opportunities) */
line-break: loose;
/* Prevent breaking across lines entirely */
white-space: nowrap;
Python and Line Breaking
Python's textwrap module uses a simplified version of line breaking for ASCII-centric text. For full UAX #14 compliance, the uharfbuzz or PyICU libraries provide proper implementations:
import textwrap
# Basic wrapping (works for Latin text)
text = "The quick brown fox jumps over the lazy dog."
print(textwrap.fill(text, width=30))
# For CJK text, textwrap.fill does NOT insert breaks
# between ideographs — use a proper UAX #14 implementation
# (uharfbuzz, or browser rendering)
Quick Facts
| Property | Value |
|---|---|
| Specification | Unicode Standard Annex #14 (UAX #14) |
| Number of line break classes | 43 |
| CJK default | Break opportunity after every ideograph |
| Latin default | No break except at spaces/hyphens |
| CSS connection | word-break, overflow-wrap, line-break |
| Mandatory break chars | U+000A LF, U+000D CR, U+000C FF, U+0085 NEL, U+2028 LS, U+2029 PS |
| No-break space | U+00A0 has class GL (Glue) — prevents break at that position |
Istilah Terkait
Lainnya di Algoritma
Algoritma yang menentukan urutan tampilan karakter dalam teks dengan arah campuran (misalnya, …
Algoritma standar untuk membandingkan dan mengurutkan string Unicode menggunakan perbandingan bertingkat: karakter …
Posisi antar kalimat menurut aturan Unicode. Lebih kompleks dari sekadar pemisahan pada …
Posisi antar kata yang ditentukan oleh aturan pemisah kata Unicode. Bukan sekadar …
Mapping characters to a common case form for case-insensitive comparison. More comprehensive …
Rules (UAX#29) for determining where one user-perceived character ends and another begins. …
Normalization Form C: dekomposisi lalu rekomposisi secara kanonik, menghasilkan bentuk terpendek. Direkomendasikan …
Normalization Form D: dekomposisi penuh tanpa rekomposisi. Digunakan oleh sistem file macOS …
Normalization Form KC: dekomposisi kompatibilitas lalu komposisi kanonik. Menggabungkan karakter yang mirip …
Normalization Form KD: dekomposisi kompatibilitas tanpa rekomposisi. Normalisasi paling agresif, kehilangan informasi …