The Unicode Bidirectional Algorithm
The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of left-to-right and right-to-left scripts is displayed on screen. This guide explains how the algorithm works and the security risks that arise from abusing bidirectional control characters.
Most writing systems flow in one direction — English runs left-to-right (LTR), Arabic and Hebrew run right-to-left (RTL). But real-world text is rarely that simple. An Arabic sentence that mentions a product name in English, a Hebrew paragraph containing a JavaScript code snippet, or an English UI label with an Arabic user name — these all mix scripts with opposite directionalities. The Unicode Bidirectional Algorithm (UBA, commonly called "Bidi") is the specification that determines how such mixed text is ordered for display.
Why Bidi Is Hard
Consider this logical string stored in memory:
"The word مرحبا means hello"
In memory, the characters are stored in logical order — the order you would type them. But on screen, the Arabic word must render right-to-left while the English words render left-to-right. The display engine must reorder the characters into visual order without any explicit instructions from the author.
The UBA handles this automatically in most cases. But when it guesses wrong — or when the text contains numbers, punctuation, or nested direction changes — the result can be confusing or even dangerous.
Key Terminology
| Term | Meaning |
|---|---|
| LTR | Left-to-right (English, Latin, Cyrillic, Greek, CJK) |
| RTL | Right-to-left (Arabic, Hebrew, Thaana, Syriac, N'Ko) |
| Bidi | Bidirectional — text that contains both LTR and RTL runs |
| Logical order | The order characters are stored in memory (typing order) |
| Visual order | The order characters appear on screen after reordering |
| Run | A maximal contiguous sequence of characters with the same direction |
| Embedding level | A number (0–125) assigned to each character indicating its nesting depth and direction |
| Base direction | The paragraph's overall direction (determines alignment and default behavior) |
How the Algorithm Works
The UBA, defined in Unicode Standard Annex #9, operates in four phases:
Phase 1: Determine the Paragraph Embedding Level
The algorithm scans the paragraph for the first character with a strong directional type (a letter, not a number or punctuation). If that character is LTR (e.g., Latin, Greek, CJK), the paragraph embedding level is 0 (LTR). If it is RTL (e.g., Arabic, Hebrew), the level is 1 (RTL).
This paragraph level determines text alignment (left for LTR, right for RTL) and how ambiguous characters (like spaces and punctuation) are positioned.
Phase 2: Determine Explicit Embedding Levels
Explicit bidi control characters (like LRE, RLE, LRO, RRO, PDF, and the newer LRI, RLI, FSI, PDI) override or isolate the direction of text spans. The algorithm processes these to assign an embedding level to every character:
- Even levels (0, 2, 4, ...) = LTR
- Odd levels (1, 3, 5, ...) = RTL
Phase 3: Resolve Weak and Neutral Types
Many characters do not have an inherent direction:
| Type | Examples | Behavior |
|---|---|---|
| Strong | Letters (A, ا, א) | Direction is intrinsic |
| Weak | Numbers (0–9), currency signs, combiners | Direction depends on context |
| Neutral | Spaces, punctuation (. , ! ?), paragraph separators | Direction inherited from surrounding characters |
The algorithm resolves weak types first (numbers adopt the direction of their surrounding letters) and then neutral types (a space between two RTL words becomes RTL; a space between an RTL and LTR word resolves based on the embedding level).
Phase 4: Reorder for Display
Finally, the algorithm reverses each RTL run in place, producing the visual order. Even-level runs stay in logical order; odd-level runs are reversed.
A Worked Example
Logical order in memory (L = Latin letter, A = Arabic letter):
L L L A A A A L L
H e y م ر ح ب O K
Step 1: First strong character is 'H' (Latin) → paragraph level = 0 (LTR).
Step 2: No explicit embedding controls → default levels apply: - H, e, y: level 0 (LTR) - Arabic letters: level 1 (RTL) - O, K: level 0 (LTR) - Spaces: resolved to adjacent levels
Step 3: Spaces between "Hey" and Arabic → neutral, resolved to level 0 (LTR context). Space between Arabic and "OK" → same.
Step 4: Reverse level-1 runs. The Arabic letters are reversed for display:
Visual: H e y ب ح ر م O K
The Arabic word, which was stored as mem-ra-ha-ba in logical order, is displayed as ba-ha-ra-mem (right-to-left) — which is correct for Arabic readers.
Bidi Character Types
Every Unicode character has a Bidi_Class property. The most important classes:
| Class | Name | Examples |
|---|---|---|
| L | Left-to-Right | Latin, Greek, Cyrillic, CJK, Thai |
| R | Right-to-Left | Hebrew, N'Ko, Thaana |
| AL | Arabic Letter | Arabic, Syriac, Mandaic |
| EN | European Number | 0 1 2 3 4 5 6 7 8 9 |
| AN | Arabic Number | Arabic-Indic digits: ٠ ١ ٢ ٣ |
| ES | European Separator | + - |
| CS | Common Separator | , . : / |
| ET | European Terminator | $ EUR % # |
| ON | Other Neutral | ! ? @ { } |
| WS | Whitespace | Space, tab, line separators |
| B | Paragraph Separator | Line feed, paragraph separator |
| BN | Boundary Neutral | Most control characters, BOM |
You can look up any character's Bidi_Class in the Unicode Character Database or on UnicodeFYI.com character pages.
Explicit Bidi Control Characters
When the automatic algorithm produces wrong results, you can insert invisible control characters to correct the direction:
Marks (Zero-Width)
| Character | Code Point | Purpose |
|---|---|---|
| LRM (Left-to-Right Mark) | U+200E | Forces LTR context at insertion point |
| RLM (Right-to-Left Mark) | U+200F | Forces RTL context at insertion point |
| ALM (Arabic Letter Mark) | U+061C | Forces Arabic Letter context (Unicode 6.3+) |
Isolates (Recommended, Unicode 6.3+)
| Character | Code Point | Purpose |
|---|---|---|
| LRI (Left-to-Right Isolate) | U+2066 | Start an LTR-isolated span |
| RLI (Right-to-Left Isolate) | U+2067 | Start an RTL-isolated span |
| FSI (First Strong Isolate) | U+2068 | Auto-detect direction for the span |
| PDI (Pop Directional Isolate) | U+2069 | End the most recent isolate |
Legacy Embeddings (Deprecated)
| Character | Code Point | Purpose |
|---|---|---|
| LRE (Left-to-Right Embedding) | U+202A | Start LTR embedding (deprecated) |
| RLE (Right-to-Left Embedding) | U+202B | Start RTL embedding (deprecated) |
| LRO (Left-to-Right Override) | U+202D | Force LTR for all characters (deprecated) |
| RLO (Right-to-Left Override) | U+202E | Force RTL for all characters (deprecated) |
| PDF (Pop Directional Formatting) | U+202C | End the most recent embedding/override |
Always prefer isolates over embeddings. Isolates were introduced in Unicode 6.3 specifically because embeddings have a well-known flaw: they leak directional influence into surrounding text. Isolates create a clean boundary.
Common Bidi Bugs and Fixes
Bug 1: Punctuation Sticks to the Wrong Side
English text: "see مقال (article)"
The parentheses and the word "article" should be LTR, but the bidi algorithm may pull the opening parenthesis toward the Arabic text, displaying something like:
see مقال) article(
Fix: Insert a Left-to-Right Mark (LRM, U+200E) after the Arabic text:
see مقال (article)
^ LRM here (invisible)
In HTML: see مقال‎ (article)
Bug 2: Numbers in RTL Context
Arabic text with a number:
Logical: سعر ١٢٣ دولار
European digits (0–9) and Arabic-Indic digits (٠–٩) have different bidi classes. European digits are "European Number" (EN), which the algorithm treats as LTR in many contexts. This can cause numbers to appear on the wrong side of adjacent text.
Fix: Use FSI/PDI isolates around the number, or ensure the paragraph direction is set
correctly with the dir attribute in HTML.
Bug 3: File Paths and URLs in RTL Pages
A URL like https://example.com/path/to/page embedded in an RTL paragraph may render with the
slashes in wrong positions:
Displayed: https://example.com/path/to/page ← might become garbled
Fix: Wrap the URL in an LTR isolate:
<a href="..." dir="ltr">https://example.com/path/to/page</a>
Or use the <bdi> element, which automatically isolates its content:
<bdi>https://example.com/path/to/page</bdi>
Bug 4: User-Generated Content Injection
When displaying a username from user input inside a sentence, the username's direction can corrupt the surrounding layout:
<!-- Dangerous: user-supplied name can flip the sentence -->
<p>Logged in as USERNAME.</p>
If USERNAME contains RTL characters, the period and surrounding text may reorder unexpectedly.
Fix: Always isolate user-generated content:
<p>Logged in as <bdi>USERNAME</bdi>.</p>
HTML and CSS Bidi Controls
The dir Attribute
The dir attribute on HTML elements is the primary mechanism for controlling text direction:
<html dir="rtl" lang="ar"> <!-- RTL page -->
<p dir="ltr">English paragraph</p> <!-- LTR override within RTL page -->
<p dir="auto">User content</p> <!-- Auto-detect from first strong char -->
The dir="auto" value is particularly useful for user-generated content — it examines the first
strong character and sets the direction accordingly.
The <bdo> and <bdi> Elements
<!-- bdo: Override direction (force all characters to this direction) -->
<bdo dir="rtl">Hello</bdo> <!-- Renders: olleH -->
<!-- bdi: Isolate content (prevent it from affecting surroundings) -->
<p>User <bdi>محمد</bdi> posted 3 comments.</p>
CSS Properties
/* Set direction on an element */
.rtl-block {
direction: rtl;
unicode-bidi: isolate; /* Recommended: isolate this element */
}
/* unicode-bidi values:
normal — default, no special behavior
embed — legacy, opens an embedding (avoid)
isolate — recommended, creates an isolate
bidi-override — forces direction on all content
isolate-override — isolate + override combined
plaintext — determines direction from content, ignoring parent
*/
Important: The CSS direction property should only be used for document structure and layout.
For inline text direction, prefer HTML dir attributes and Unicode bidi characters.
Bidi and Security: The Trojan Source Attack
In 2021, researchers at Cambridge described the Trojan Source attack (CVE-2021-42574), which exploits bidi override characters to make source code appear different from what it actually does.
Consider this Python code:
access_level = "user\u202e \u2066# Check if admin\u2069 \u2066"
The RLO (U+202E) and LRI/PDI characters make the string look like a comment in some editors, but it is actually executable code with a different value.
Defenses:
-
Lint for bidi controls: Reject source files containing U+202A–U+202E, U+2066–U+2069 in string literals and comments. Compilers like Rust and GCC now warn by default.
-
Render bidi controls visibly: IDEs like VS Code show bidi control characters as visible glyphs.
-
Code review: Review diffs in an environment that reveals invisible characters.
-
GitHub: GitHub now highlights bidi characters in source code views.
Bidi in Programming Languages
Python
import unicodedata
# Check bidi class of a character
unicodedata.bidirectional("A") # 'L' (Left-to-Right)
unicodedata.bidirectional("\u0627") # 'AL' (Arabic Letter) — alef
unicodedata.bidirectional("5") # 'EN' (European Number)
unicodedata.bidirectional(" ") # 'WS' (Whitespace)
unicodedata.bidirectional("\u200E") # 'L' (LRM acts as strong LTR)
unicodedata.bidirectional("\u200F") # 'R' (RLM acts as strong RTL)
JavaScript
// Insert bidi marks programmatically
const LRM = "\u200E";
const RLM = "\u200F";
function isolateBidi(text) {
// Wrap in First Strong Isolate / Pop Directional Isolate
return "\u2068" + text + "\u2069";
}
const username = "\u0645\u062D\u0645\u062F"; // محمد
const msg = `Logged in as ${isolateBidi(username)}.`;
Java
// Check bidi class
Character.getDirectionality('A');
// → Character.DIRECTIONALITY_LEFT_TO_RIGHT (0)
Character.getDirectionality('\u0627');
// → Character.DIRECTIONALITY_RIGHT_TO_LEFT_ARABIC (2)
// java.text.Bidi class for paragraph-level analysis
java.text.Bidi bidi = new java.text.Bidi("Hello \u0645\u0631\u062D\u0628\u0627", 0);
bidi.getBaseLevel(); // 0 (LTR paragraph)
bidi.getRunCount(); // 3 (LTR run, RTL run, LTR run)
bidi.isLeftToRight(); // false (mixed)
bidi.isMixed(); // true
Testing Bidi Behavior
Use these test strings to verify that your application handles bidirectional text correctly:
| Test Case | Logical String | Expected Visual |
|---|---|---|
| Simple RTL | مرحبا |
Right-to-left word |
| Mixed LTR-RTL | Hello مرحبا World |
"Hello" LTR, Arabic RTL, "World" LTR |
| Number in RTL | سعر 100 دولار |
Number between RTL words |
| Nested directions | English عربي English عربي English |
Alternating direction runs |
| Paren in RTL | see (مقال) here |
Parentheses should stay with content |
| URL in RTL | انظر https://x.com/ هنا |
URL should remain LTR |
Best Practices
-
Always set
diron your<html>element: Declare the base direction explicitly rather than relying on browser defaults. -
Use
dir="auto"for user-generated content: Let the browser detect the direction from the first strong character. Wrap in<bdi>when the content is inline within a sentence. -
Prefer isolates over embeddings: Use LRI/RLI/FSI with PDI (Unicode 6.3+) or the HTML
<bdi>element. Avoid the legacy LRE/RLE/LRO/RRO characters. -
Test with real RTL text: Use actual Arabic or Hebrew strings in your test suite, not just LTR text with
dir="rtl". -
Scan source code for bidi controls: Add a pre-commit hook or CI check that flags U+202A–U+202E and U+2066–U+2069 in source files to prevent Trojan Source attacks.
-
Use CSS
unicode-bidi: isolate: When settingdirectionin CSS, always pair it withunicode-bidi: isolateto prevent directional leakage. -
Handle numbers carefully: European digits and Arabic-Indic digits have different bidi classes. Test formatting of prices, dates, and phone numbers in both LTR and RTL contexts.
-
Mind the invisible characters: Bidi controls are zero-width and invisible. When debugging bidi issues, use a tool that reveals them — such as the UnicodeFYI character analyzer or a hex editor.
Más en Unicode Fundamentals
Unicode is the universal character encoding standard that assigns a unique number …
UTF-8 is the dominant character encoding on the web, capable of representing …
UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …
A Unicode code point is the unique number assigned to each character …
Unicode is divided into 17 planes, each containing up to 65,536 code …
The Byte Order Mark (BOM) is a special Unicode character used at …
Surrogate pairs are a mechanism in UTF-16 that allows code points outside …
ASCII defined 128 characters for the English alphabet and was the foundation …
The same visible character can be represented by multiple different byte sequences …
Every Unicode character belongs to a general category such as Letter, Number, …
Unicode blocks are contiguous ranges of code points grouped by script or …
Unicode assigns every character to a script property that identifies the writing …
Combining characters are Unicode code points that attach to a preceding base …
A single visible character on screen — called a grapheme cluster — …
Unicode confusables are characters that look identical or nearly identical to others, …
Zero-width characters are invisible Unicode code points that affect text layout, joining, …
Unicode defines over two dozen whitespace characters beyond the ordinary space, including …
Unicode began in 1987 as a collaboration between engineers at Apple and …
Unicode has released major versions regularly since 1.0 in 1991, with each …