Algorithme de coupure de ligne
Règles déterminant où le texte peut passer à la ligne suivante, en tenant compte des propriétés des caractères, des limites de mots CJK et des opportunités de coupure.
Breaking Text Across Lines
When a paragraph of text is too wide for its container, a rendering engine must decide where to insert line breaks. This sounds straightforward but involves subtle rules that vary by script, language, and context. The Unicode Line Breaking Algorithm (UAX #14) provides a systematic framework for finding valid break opportunities.
The algorithm does not tell renderers where to break — that depends on line width and font metrics. Instead, it classifies each position between characters as a mandatory break, a break opportunity (the renderer may break here), or no break (breaking here is not allowed).
Line Break Classes
Each Unicode character is assigned a line break class. There are over 40 classes; the most important:
| Class | Code | Examples | Behavior |
|---|---|---|---|
| Break After | BA | spaces, hyphens | Break opportunity after |
| Break Before | BB | em dash in some contexts | Break opportunity before |
| Alphabetic | AL | Latin, Cyrillic letters | No break between letters |
| Ideographic | ID | CJK unified ideographs | Break opportunity after each |
| Nonstarter | NS | Japanese small kana (っ, ょ) | Cannot start a new line |
| Close Punctuation | CL | ), ], » |
No break before (attaches to preceding) |
| Open Punctuation | OP | (, [, « |
No break after (attaches to following) |
| Mandatory Break | BK | U+000C FORM FEED | Must break here |
| Carriage Return | CR | U+000D | Mandatory break (with LF handling) |
| Glue | GL | U+00A0 NO-BREAK SPACE | No break allowed |
The key insight for international text: CJK ideographic characters (Chinese, Japanese, Korean) have a break opportunity after almost every character because those scripts historically had no spaces. Latin letters have no break opportunities between them — only at space or hyphen positions.
Connection to CSS
The UAX #14 algorithm is the foundation for CSS line-breaking properties:
/* Allow break at any character (overrides UAX #14 for CJK) */
word-break: break-all;
/* Break long words that would overflow */
overflow-wrap: break-word;
/* Strict CJK line breaking (no breaks before small kana) */
line-break: strict;
/* Loose CJK line breaking (more break opportunities) */
line-break: loose;
/* Prevent breaking across lines entirely */
white-space: nowrap;
Python and Line Breaking
Python's textwrap module uses a simplified version of line breaking for ASCII-centric text. For full UAX #14 compliance, the uharfbuzz or PyICU libraries provide proper implementations:
import textwrap
# Basic wrapping (works for Latin text)
text = "The quick brown fox jumps over the lazy dog."
print(textwrap.fill(text, width=30))
# For CJK text, textwrap.fill does NOT insert breaks
# between ideographs — use a proper UAX #14 implementation
# (uharfbuzz, or browser rendering)
Quick Facts
| Property | Value |
|---|---|
| Specification | Unicode Standard Annex #14 (UAX #14) |
| Number of line break classes | 43 |
| CJK default | Break opportunity after every ideograph |
| Latin default | No break except at spaces/hyphens |
| CSS connection | word-break, overflow-wrap, line-break |
| Mandatory break chars | U+000A LF, U+000D CR, U+000C FF, U+0085 NEL, U+2028 LS, U+2029 PS |
| No-break space | U+00A0 has class GL (Glue) — prevents break at that position |
Termes associés
Plus dans Algorithmes
Algorithme déterminant l'ordre d'affichage des caractères dans un texte à direction mixte …
Algorithme standard de comparaison et de tri de chaînes Unicode via une …
Mapping characters to a common case form for case-insensitive comparison. More comprehensive …
Caractères exclus de la composition canonique (NFC) pour éviter la décomposition des …
La position entre les mots selon les règles de coupure de mots …
La position entre les phrases selon les règles Unicode. Plus complexe qu'un …
Rules (UAX#29) for determining where one user-perceived character ends and another begins. …
Forme de normalisation C : décomposer puis recomposer canoniquement, produisant la forme …
Forme de normalisation D : décomposition complète sans recomposition. Utilisée par le …
Forme de normalisation KC : décomposition de compatibilité puis composition canonique. Fusionne …