Алгоритмы

Двунаправленный алгоритм

Алгоритм определения порядка отображения символов в разнонаправленном тексте (например, английский + арабский) с использованием bidi-категорий и явных переопределений направления.

· Updated

The Challenge of Mixed-Direction Text

English reads left-to-right (LTR). Arabic and Hebrew read right-to-left (RTL). When you have both in the same paragraph — a common situation in multilingual documents, URLs in Arabic text, or numbers in Hebrew — the rendering engine needs a precise set of rules to determine the visual order of characters. That set of rules is the Unicode Bidirectional Algorithm (UBA), specified in Unicode Standard Annex #9.

The UBA operates on logical order (the order characters are stored) and produces visual order (the order glyphs are rendered on screen). Most of the time this is invisible to users — text just displays correctly. But when it goes wrong, entire paragraphs can appear mirrored, or security-relevant filenames can be displayed in a different order than they are stored.

Implicit vs. Explicit Directionality

The UBA assigns every character a Bidi category based on its Unicode property. Common categories:

Category Abbr Examples
Left-to-Right L Latin, Cyrillic, CJK
Right-to-Left R Hebrew
Arabic Letter AL Arabic, Thaana
European Number EN 0–9
Common Separator CS , .
Paragraph Separator B newline
Boundary Neutral BN formatting characters

Using these categories, the algorithm assigns embedding levels (even = LTR, odd = RTL) and resolves the visual order automatically. This implicit handling covers the vast majority of cases.

Explicit Directional Formatting Characters

When implicit resolution produces the wrong order, Unicode provides directional formatting characters to override it:

Character Code point Name Purpose
LRE U+202A Left-to-Right Embedding Start LTR embedded text
RLE U+202B Right-to-Left Embedding Start RTL embedded text
LRO U+202D Left-to-Right Override Force LTR regardless of characters
RLO U+202E Right-to-Left Override Force RTL regardless of characters
PDF U+202C Pop Directional Formatting End embedding/override
LRI U+2066 Left-to-Right Isolate Isolate LTR run (Unicode 6.3+)
RLI U+2067 Right-to-Left Isolate Isolate RTL run (Unicode 6.3+)
FSI U+2068 First Strong Isolate Auto-detect direction
PDI U+2069 Pop Directional Isolate End isolate

The LRI/RLI/FSI/PDI isolate controls (added in Unicode 6.3) are preferred over the older embedding controls because isolates do not affect the surrounding text's bidi resolution — they are fully contained.

Security: The Bidi Trojan Source Attack

The RLO character (U+202E) can be used maliciously to display a filename or code string in a different order than it is stored. A file named innocent‮fdp.exe can display as innocent.pdf. This "Trojan Source" attack (CVE-2021-42574) affected code editors that rendered bidi formatting in source files. Mitigation: strip or escape U+202A–U+202E and U+2066–U+2069 in user-supplied text displayed in security contexts.

Quick Facts

Property Value
Specification Unicode Standard Annex #9 (UAX #9)
Also known as UBA, Bidi Algorithm
Paragraph base direction Determined by first strong character, or explicit override
CSS property direction: rtl/ltr, unicode-bidi: embed/bidi-override/isolate
HTML attribute dir="rtl", dir="ltr", dir="auto"
Security risk RLO spoofing (Trojan Source, CVE-2021-42574)
Preferred controls Isolates (LRI/RLI/FSI/PDI) over legacy embeddings

Связанные термины

Ещё в Алгоритмы

Case Folding

Mapping characters to a common case form for case-insensitive comparison. More comprehensive …

Grapheme Cluster Boundary

Rules (UAX#29) for determining where one user-perceived character ends and another begins. …

NFC (Canonical Composition)

Normalization Form C: декомпозиция с последующей канонической рекомпозицией, дающая кратчайшую форму. Рекомендуется …

NFD (Canonical Decomposition)

Normalization Form D: полная декомпозиция без рекомпозиции. Используется файловой системой macOS HFS+. …

NFKC (Compatibility Composition)

Normalization Form KC: совместимая декомпозиция с последующей канонической композицией. Объединяет визуально похожие …

NFKD (Compatibility Decomposition)

Normalization Form KD: совместимая декомпозиция без рекомпозиции. Самая агрессивная нормализация с максимальной …

String Comparison

Comparing Unicode strings requires normalization (NFC/NFD) and optionally collation (locale-aware sorting). Binary …

Алгоритм переноса строки

Правила определения мест переноса текста на следующую строку с учетом свойств символов, …

Алгоритм сортировки

Стандартный алгоритм сравнения и сортировки строк Unicode с многоуровневым сравнением: базовый символ …

Граница предложения

Позиция между предложениями по правилам Unicode. Сложнее разделения по точкам — учитывает …