What is Jointure de mot?

U+2060. Un caractère de largeur nulle qui empêche le retour à la ligne. Le remplacement moderne de U+FEFF (BOM) en tant qu'espace sans saut de largeur nulle.

What is Espace insécable?

U+00A0. Un espace qui empêche le retour à la ligne à sa position. HTML : . Utilisé entre les chiffres et les unités (100 km), dans les noms propres (M. Smith) et après les abréviations.

Algorithmes

Algorithme de coupure de ligne

Règles déterminant où le texte peut passer à la ligne suivante, en tenant compte des propriétés des caractères, des limites de mots CJK et des opportunités de coupure.

2022-09-12 · Updated 2024-06-05

Breaking Text Across Lines

When a paragraph of text is too wide for its container, a rendering engine must decide where to insert line breaks. This sounds straightforward but involves subtle rules that vary by script, language, and context. The Unicode Line Breaking Algorithm (UAX #14) provides a systematic framework for finding valid break opportunities.

The algorithm does not tell renderers where to break — that depends on line width and font metrics. Instead, it classifies each position between characters as a mandatory break, a break opportunity (the renderer may break here), or no break (breaking here is not allowed).

Line Break Classes

Each Unicode character is assigned a line break class. There are over 40 classes; the most important:

Class	Code	Examples	Behavior
Break After	BA	spaces, hyphens	Break opportunity after
Break Before	BB	em dash in some contexts	Break opportunity before
Alphabetic	AL	Latin, Cyrillic letters	No break between letters
Ideographic	ID	CJK unified ideographs	Break opportunity after each
Nonstarter	NS	Japanese small kana (っ, ょ)	Cannot start a new line
Close Punctuation	CL	`)`, `]`, `»`	No break before (attaches to preceding)
Open Punctuation	OP	`(`, `[`, `«`	No break after (attaches to following)
Mandatory Break	BK	U+000C FORM FEED	Must break here
Carriage Return	CR	U+000D	Mandatory break (with LF handling)
Glue	GL	U+00A0 NO-BREAK SPACE	No break allowed

The key insight for international text: CJK ideographic characters (Chinese, Japanese, Korean) have a break opportunity after almost every character because those scripts historically had no spaces. Latin letters have no break opportunities between them — only at space or hyphen positions.

Connection to CSS

The UAX #14 algorithm is the foundation for CSS line-breaking properties:

/* Allow break at any character (overrides UAX #14 for CJK) */
word-break: break-all;

/* Break long words that would overflow */
overflow-wrap: break-word;

/* Strict CJK line breaking (no breaks before small kana) */
line-break: strict;

/* Loose CJK line breaking (more break opportunities) */
line-break: loose;

/* Prevent breaking across lines entirely */
white-space: nowrap;

Python and Line Breaking

Python's textwrap module uses a simplified version of line breaking for ASCII-centric text. For full UAX #14 compliance, the uharfbuzz or PyICU libraries provide proper implementations:

import textwrap

# Basic wrapping (works for Latin text)
text = "The quick brown fox jumps over the lazy dog."
print(textwrap.fill(text, width=30))

# For CJK text, textwrap.fill does NOT insert breaks
# between ideographs — use a proper UAX #14 implementation
# (uharfbuzz, or browser rendering)

Quick Facts

Property	Value
Specification	Unicode Standard Annex #14 (UAX #14)
Number of line break classes	43
CJK default	Break opportunity after every ideograph
Latin default	No break except at spaces/hyphens
CSS connection	`word-break`, `overflow-wrap`, `line-break`
Mandatory break chars	U+000A LF, U+000D CR, U+000C FF, U+0085 NEL, U+2028 LS, U+2029 PS
No-break space	U+00A0 has class GL (Glue) — prevents break at that position

Termes associés

Jointure de mot Espace insécable

Plus dans Algorithmes

Algorithme bidirectionnel

Algorithme déterminant l'ordre d'affichage des caractères dans un texte à direction mixte …

Algorithme de classement

Algorithme standard de comparaison et de tri de chaînes Unicode via une …

Case Folding

Mapping characters to a common case form for case-insensitive comparison. More comprehensive …

Exclusion de composition

Caractères exclus de la composition canonique (NFC) pour éviter la décomposition des …

Frontière de mot

La position entre les mots selon les règles de coupure de mots …

Frontière de phrase

La position entre les phrases selon les règles Unicode. Plus complexe qu'un …

Grapheme Cluster Boundary

Rules (UAX#29) for determining where one user-perceived character ends and another begins. …

NFC (Canonical Composition)

Forme de normalisation C : décomposer puis recomposer canoniquement, produisant la forme …

NFD (Canonical Decomposition)

Forme de normalisation D : décomposition complète sans recomposition. Utilisée par le …

NFKC (Compatibility Composition)

Forme de normalisation KC : décomposition de compatibilité puis composition canonique. Fusionne …

← Retour au glossaire