Zeilenumbruch-Algorithmus
Regeln zur Bestimmung von Zeilenumbruchpositionen im Text unter Berücksichtigung von Zeicheneigenschaften, CJK-Wortgrenzen und Umbruchmöglichkeiten.
Breaking Text Across Lines
When a paragraph of text is too wide for its container, a rendering engine must decide where to insert line breaks. This sounds straightforward but involves subtle rules that vary by script, language, and context. The Unicode Line Breaking Algorithm (UAX #14) provides a systematic framework for finding valid break opportunities.
The algorithm does not tell renderers where to break — that depends on line width and font metrics. Instead, it classifies each position between characters as a mandatory break, a break opportunity (the renderer may break here), or no break (breaking here is not allowed).
Line Break Classes
Each Unicode character is assigned a line break class. There are over 40 classes; the most important:
| Class | Code | Examples | Behavior |
|---|---|---|---|
| Break After | BA | spaces, hyphens | Break opportunity after |
| Break Before | BB | em dash in some contexts | Break opportunity before |
| Alphabetic | AL | Latin, Cyrillic letters | No break between letters |
| Ideographic | ID | CJK unified ideographs | Break opportunity after each |
| Nonstarter | NS | Japanese small kana (っ, ょ) | Cannot start a new line |
| Close Punctuation | CL | ), ], » |
No break before (attaches to preceding) |
| Open Punctuation | OP | (, [, « |
No break after (attaches to following) |
| Mandatory Break | BK | U+000C FORM FEED | Must break here |
| Carriage Return | CR | U+000D | Mandatory break (with LF handling) |
| Glue | GL | U+00A0 NO-BREAK SPACE | No break allowed |
The key insight for international text: CJK ideographic characters (Chinese, Japanese, Korean) have a break opportunity after almost every character because those scripts historically had no spaces. Latin letters have no break opportunities between them — only at space or hyphen positions.
Connection to CSS
The UAX #14 algorithm is the foundation for CSS line-breaking properties:
/* Allow break at any character (overrides UAX #14 for CJK) */
word-break: break-all;
/* Break long words that would overflow */
overflow-wrap: break-word;
/* Strict CJK line breaking (no breaks before small kana) */
line-break: strict;
/* Loose CJK line breaking (more break opportunities) */
line-break: loose;
/* Prevent breaking across lines entirely */
white-space: nowrap;
Python and Line Breaking
Python's textwrap module uses a simplified version of line breaking for ASCII-centric text. For full UAX #14 compliance, the uharfbuzz or PyICU libraries provide proper implementations:
import textwrap
# Basic wrapping (works for Latin text)
text = "The quick brown fox jumps over the lazy dog."
print(textwrap.fill(text, width=30))
# For CJK text, textwrap.fill does NOT insert breaks
# between ideographs — use a proper UAX #14 implementation
# (uharfbuzz, or browser rendering)
Quick Facts
| Property | Value |
|---|---|
| Specification | Unicode Standard Annex #14 (UAX #14) |
| Number of line break classes | 43 |
| CJK default | Break opportunity after every ideograph |
| Latin default | No break except at spaces/hyphens |
| CSS connection | word-break, overflow-wrap, line-break |
| Mandatory break chars | U+000A LF, U+000D CR, U+000C FF, U+0085 NEL, U+2028 LS, U+2029 PS |
| No-break space | U+00A0 has class GL (Glue) — prevents break at that position |
Verwandte Begriffe
Mehr in Algorithmen
Algorithmus zur Bestimmung der Anzeigereihenfolge von Zeichen in Text mit gemischter Schreibrichtung …
Mapping characters to a common case form for case-insensitive comparison. More comprehensive …
Rules (UAX#29) for determining where one user-perceived character ends and another begins. …
Zeichen, die von der kanonischen Komposition (NFC) ausgeschlossen sind, um die Nicht-Starter-Zerlegung …
Normalisierungsform C: Zerlegen und anschließend kanonisch zusammensetzen, um die kürzeste Form zu …
Normalisierungsform D: vollständige Zerlegung ohne Zusammensetzung. Wird vom macOS-HFS+-Dateisystem verwendet. é (U+00E9) …
Normalisierungsform KC: Kompatibilitätszerlegung gefolgt von kanonischer Zusammensetzung. Führt visuell ähnliche Zeichen zusammen …
Normalisierungsform KD: Kompatibilitätszerlegung ohne Zusammensetzung. Die aggressivste Normalisierung mit dem höchsten Verlust …
Prozess der Umwandlung von Unicode-Text in eine standardisierte kanonische Form. Vier Formen: …
Die Position zwischen Sätzen gemäß den Unicode-Regeln. Komplexer als das bloße Aufteilen …