Thuật toán ngắt dòng
Các quy tắc xác định vị trí văn bản có thể xuống dòng, xem xét các thuộc tính ký tự, ranh giới từ CJK và các cơ hội ngắt.
Breaking Text Across Lines
When a paragraph of text is too wide for its container, a rendering engine must decide where to insert line breaks. This sounds straightforward but involves subtle rules that vary by script, language, and context. The Unicode Line Breaking Algorithm (UAX #14) provides a systematic framework for finding valid break opportunities.
The algorithm does not tell renderers where to break — that depends on line width and font metrics. Instead, it classifies each position between characters as a mandatory break, a break opportunity (the renderer may break here), or no break (breaking here is not allowed).
Line Break Classes
Each Unicode character is assigned a line break class. There are over 40 classes; the most important:
| Class | Code | Examples | Behavior |
|---|---|---|---|
| Break After | BA | spaces, hyphens | Break opportunity after |
| Break Before | BB | em dash in some contexts | Break opportunity before |
| Alphabetic | AL | Latin, Cyrillic letters | No break between letters |
| Ideographic | ID | CJK unified ideographs | Break opportunity after each |
| Nonstarter | NS | Japanese small kana (っ, ょ) | Cannot start a new line |
| Close Punctuation | CL | ), ], » |
No break before (attaches to preceding) |
| Open Punctuation | OP | (, [, « |
No break after (attaches to following) |
| Mandatory Break | BK | U+000C FORM FEED | Must break here |
| Carriage Return | CR | U+000D | Mandatory break (with LF handling) |
| Glue | GL | U+00A0 NO-BREAK SPACE | No break allowed |
The key insight for international text: CJK ideographic characters (Chinese, Japanese, Korean) have a break opportunity after almost every character because those scripts historically had no spaces. Latin letters have no break opportunities between them — only at space or hyphen positions.
Connection to CSS
The UAX #14 algorithm is the foundation for CSS line-breaking properties:
/* Allow break at any character (overrides UAX #14 for CJK) */
word-break: break-all;
/* Break long words that would overflow */
overflow-wrap: break-word;
/* Strict CJK line breaking (no breaks before small kana) */
line-break: strict;
/* Loose CJK line breaking (more break opportunities) */
line-break: loose;
/* Prevent breaking across lines entirely */
white-space: nowrap;
Python and Line Breaking
Python's textwrap module uses a simplified version of line breaking for ASCII-centric text. For full UAX #14 compliance, the uharfbuzz or PyICU libraries provide proper implementations:
import textwrap
# Basic wrapping (works for Latin text)
text = "The quick brown fox jumps over the lazy dog."
print(textwrap.fill(text, width=30))
# For CJK text, textwrap.fill does NOT insert breaks
# between ideographs — use a proper UAX #14 implementation
# (uharfbuzz, or browser rendering)
Quick Facts
| Property | Value |
|---|---|
| Specification | Unicode Standard Annex #14 (UAX #14) |
| Number of line break classes | 43 |
| CJK default | Break opportunity after every ideograph |
| Latin default | No break except at spaces/hyphens |
| CSS connection | word-break, overflow-wrap, line-break |
| Mandatory break chars | U+000A LF, U+000D CR, U+000C FF, U+0085 NEL, U+2028 LS, U+2029 PS |
| No-break space | U+00A0 has class GL (Glue) — prevents break at that position |
Thuật ngữ liên quan
Thêm trong Thuật toán
Mapping characters to a common case form for case-insensitive comparison. More comprehensive …
Quá trình chuyển đổi văn bản Unicode sang dạng chuẩn chuẩn. Bốn …
Rules (UAX#29) for determining where one user-perceived character ends and another begins. …
Các ký tự bị loại trừ khỏi quá trình kết hợp chuẩn …
Dạng chuẩn C: phân tách rồi hợp thành lại theo chuẩn, tạo …
Dạng chuẩn D: phân tách hoàn toàn mà không hợp thành lại. …
Dạng chuẩn KC: phân tách tương thích rồi hợp thành chuẩn. Kết …
Dạng chuẩn KD: phân tách tương thích mà không hợp thành lại. …
Các thuật toán tìm ranh giới trong văn bản: ranh giới cụm …
Vị trí giữa các câu theo quy tắc Unicode. Phức tạp hơn …