📚 Unicode Fundamentals

The Unicode Bidirectional Algorithm

The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of left-to-right and right-to-left scripts is displayed on screen. This guide explains how the algorithm works and the security risks that arise from abusing bidirectional control characters.

Published 2021-08-02 · Updated 2024-11-25

Most writing systems flow in one direction — English runs left-to-right (LTR), Arabic and Hebrew run right-to-left (RTL). But real-world text is rarely that simple. An Arabic sentence that mentions a product name in English, a Hebrew paragraph containing a JavaScript code snippet, or an English UI label with an Arabic user name — these all mix scripts with opposite directionalities. The Unicode Bidirectional Algorithm (UBA, commonly called "Bidi") is the specification that determines how such mixed text is ordered for display.

Why Bidi Is Hard

Consider this logical string stored in memory:

"The word مرحبا means hello"

In memory, the characters are stored in logical order — the order you would type them. But on screen, the Arabic word must render right-to-left while the English words render left-to-right. The display engine must reorder the characters into visual order without any explicit instructions from the author.

The UBA handles this automatically in most cases. But when it guesses wrong — or when the text contains numbers, punctuation, or nested direction changes — the result can be confusing or even dangerous.

Key Terminology

Term	Meaning
LTR	Left-to-right (English, Latin, Cyrillic, Greek, CJK)
RTL	Right-to-left (Arabic, Hebrew, Thaana, Syriac, N'Ko)
Bidi	Bidirectional — text that contains both LTR and RTL runs
Logical order	The order characters are stored in memory (typing order)
Visual order	The order characters appear on screen after reordering
Run	A maximal contiguous sequence of characters with the same direction
Embedding level	A number (0–125) assigned to each character indicating its nesting depth and direction
Base direction	The paragraph's overall direction (determines alignment and default behavior)

How the Algorithm Works

The UBA, defined in Unicode Standard Annex #9, operates in four phases:

Phase 1: Determine the Paragraph Embedding Level

The algorithm scans the paragraph for the first character with a strong directional type (a letter, not a number or punctuation). If that character is LTR (e.g., Latin, Greek, CJK), the paragraph embedding level is 0 (LTR). If it is RTL (e.g., Arabic, Hebrew), the level is 1 (RTL).

This paragraph level determines text alignment (left for LTR, right for RTL) and how ambiguous characters (like spaces and punctuation) are positioned.

Phase 2: Determine Explicit Embedding Levels

Explicit bidi control characters (like LRE, RLE, LRO, RRO, PDF, and the newer LRI, RLI, FSI, PDI) override or isolate the direction of text spans. The algorithm processes these to assign an embedding level to every character:

Even levels (0, 2, 4, ...) = LTR
Odd levels (1, 3, 5, ...) = RTL

Phase 3: Resolve Weak and Neutral Types

Many characters do not have an inherent direction:

Type	Examples	Behavior
Strong	Letters (A, ا, א)	Direction is intrinsic
Weak	Numbers (0–9), currency signs, combiners	Direction depends on context
Neutral	Spaces, punctuation (. , ! ?), paragraph separators	Direction inherited from surrounding characters

The algorithm resolves weak types first (numbers adopt the direction of their surrounding letters) and then neutral types (a space between two RTL words becomes RTL; a space between an RTL and LTR word resolves based on the embedding level).

Phase 4: Reorder for Display

Finally, the algorithm reverses each RTL run in place, producing the visual order. Even-level runs stay in logical order; odd-level runs are reversed.

A Worked Example

Logical order in memory (L = Latin letter, A = Arabic letter):

L L L   A A A A   L L
H e y   م ر ح ب   O K

Step 1: First strong character is 'H' (Latin) → paragraph level = 0 (LTR).

Step 2: No explicit embedding controls → default levels apply: - H, e, y: level 0 (LTR) - Arabic letters: level 1 (RTL) - O, K: level 0 (LTR) - Spaces: resolved to adjacent levels

Step 3: Spaces between "Hey" and Arabic → neutral, resolved to level 0 (LTR context). Space between Arabic and "OK" → same.

Step 4: Reverse level-1 runs. The Arabic letters are reversed for display:

Visual: H e y   ب ح ر م   O K

The Arabic word, which was stored as mem-ra-ha-ba in logical order, is displayed as ba-ha-ra-mem (right-to-left) — which is correct for Arabic readers.

Bidi Character Types

Every Unicode character has a Bidi_Class property. The most important classes:

Class	Name	Examples
L	Left-to-Right	Latin, Greek, Cyrillic, CJK, Thai
R	Right-to-Left	Hebrew, N'Ko, Thaana
AL	Arabic Letter	Arabic, Syriac, Mandaic
EN	European Number	0 1 2 3 4 5 6 7 8 9
AN	Arabic Number	Arabic-Indic digits: ٠ ١ ٢ ٣
ES	European Separator	+ -
CS	Common Separator	, . : /
ET	European Terminator	$ EUR % #
ON	Other Neutral	! ? @ { }
WS	Whitespace	Space, tab, line separators
B	Paragraph Separator	Line feed, paragraph separator
BN	Boundary Neutral	Most control characters, BOM

You can look up any character's Bidi_Class in the Unicode Character Database or on UnicodeFYI.com character pages.

Explicit Bidi Control Characters

When the automatic algorithm produces wrong results, you can insert invisible control characters to correct the direction:

Marks (Zero-Width)

Character	Code Point	Purpose
LRM (Left-to-Right Mark)	U+200E	Forces LTR context at insertion point
RLM (Right-to-Left Mark)	U+200F	Forces RTL context at insertion point
ALM (Arabic Letter Mark)	U+061C	Forces Arabic Letter context (Unicode 6.3+)

Isolates (Recommended, Unicode 6.3+)

Character	Code Point	Purpose
LRI (Left-to-Right Isolate)	U+2066	Start an LTR-isolated span
RLI (Right-to-Left Isolate)	U+2067	Start an RTL-isolated span
FSI (First Strong Isolate)	U+2068	Auto-detect direction for the span
PDI (Pop Directional Isolate)	U+2069	End the most recent isolate

Legacy Embeddings (Deprecated)

Character	Code Point	Purpose
LRE (Left-to-Right Embedding)	U+202A	Start LTR embedding (deprecated)
RLE (Right-to-Left Embedding)	U+202B	Start RTL embedding (deprecated)
LRO (Left-to-Right Override)	U+202D	Force LTR for all characters (deprecated)
RLO (Right-to-Left Override)	U+202E	Force RTL for all characters (deprecated)
PDF (Pop Directional Formatting)	U+202C	End the most recent embedding/override

Always prefer isolates over embeddings. Isolates were introduced in Unicode 6.3 specifically because embeddings have a well-known flaw: they leak directional influence into surrounding text. Isolates create a clean boundary.

Common Bidi Bugs and Fixes

Bug 1: Punctuation Sticks to the Wrong Side

English text: "see مقال (article)"

The parentheses and the word "article" should be LTR, but the bidi algorithm may pull the opening parenthesis toward the Arabic text, displaying something like:

see مقال) article(

Fix: Insert a Left-to-Right Mark (LRM, U+200E) after the Arabic text:

see مقال‎ (article)
         ^ LRM here (invisible)

In HTML: see مقال&lrm; (article)

Bug 2: Numbers in RTL Context

Arabic text with a number:

Logical: سعر ١٢٣ دولار

European digits (0–9) and Arabic-Indic digits (٠–٩) have different bidi classes. European digits are "European Number" (EN), which the algorithm treats as LTR in many contexts. This can cause numbers to appear on the wrong side of adjacent text.

Fix: Use FSI/PDI isolates around the number, or ensure the paragraph direction is set correctly with the dir attribute in HTML.

Bug 3: File Paths and URLs in RTL Pages

A URL like https://example.com/path/to/page embedded in an RTL paragraph may render with the slashes in wrong positions:

Displayed: https://example.com/path/to/page  ← might become garbled

Fix: Wrap the URL in an LTR isolate:

<a href="..." dir="ltr">https://example.com/path/to/page</a>

Or use the <bdi> element, which automatically isolates its content:

<bdi>https://example.com/path/to/page</bdi>

Bug 4: User-Generated Content Injection

When displaying a username from user input inside a sentence, the username's direction can corrupt the surrounding layout:

<!-- Dangerous: user-supplied name can flip the sentence -->
<p>Logged in as USERNAME.</p>

If USERNAME contains RTL characters, the period and surrounding text may reorder unexpectedly.

Fix: Always isolate user-generated content:

<p>Logged in as <bdi>USERNAME</bdi>.</p>

HTML and CSS Bidi Controls

The `dir` Attribute

The dir attribute on HTML elements is the primary mechanism for controlling text direction:

<html dir="rtl" lang="ar">        <!-- RTL page -->
<p dir="ltr">English paragraph</p> <!-- LTR override within RTL page -->
<p dir="auto">User content</p>     <!-- Auto-detect from first strong char -->

The dir="auto" value is particularly useful for user-generated content — it examines the first strong character and sets the direction accordingly.

The `<bdo>` and `<bdi>` Elements

<!-- bdo: Override direction (force all characters to this direction) -->
<bdo dir="rtl">Hello</bdo>   <!-- Renders: olleH -->

<!-- bdi: Isolate content (prevent it from affecting surroundings) -->
<p>User <bdi>محمد</bdi> posted 3 comments.</p>

CSS Properties

/* Set direction on an element */
.rtl-block {
    direction: rtl;
    unicode-bidi: isolate;      /* Recommended: isolate this element */
}

/* unicode-bidi values:
   normal         — default, no special behavior
   embed          — legacy, opens an embedding (avoid)
   isolate        — recommended, creates an isolate
   bidi-override  — forces direction on all content
   isolate-override — isolate + override combined
   plaintext      — determines direction from content, ignoring parent
*/

Important: The CSS direction property should only be used for document structure and layout. For inline text direction, prefer HTML dir attributes and Unicode bidi characters.

Bidi and Security: The Trojan Source Attack

In 2021, researchers at Cambridge described the Trojan Source attack (CVE-2021-42574), which exploits bidi override characters to make source code appear different from what it actually does.

Consider this Python code:

access_level = "user\u202e \u2066# Check if admin\u2069 \u2066"

The RLO (U+202E) and LRI/PDI characters make the string look like a comment in some editors, but it is actually executable code with a different value.

Defenses:

Lint for bidi controls: Reject source files containing U+202A–U+202E, U+2066–U+2069 in string literals and comments. Compilers like Rust and GCC now warn by default.
Render bidi controls visibly: IDEs like VS Code show bidi control characters as visible glyphs.
Code review: Review diffs in an environment that reveals invisible characters.
GitHub: GitHub now highlights bidi characters in source code views.

Bidi in Programming Languages

Python

import unicodedata

# Check bidi class of a character
unicodedata.bidirectional("A")    # 'L'  (Left-to-Right)
unicodedata.bidirectional("\u0627")  # 'AL' (Arabic Letter) — alef
unicodedata.bidirectional("5")    # 'EN' (European Number)
unicodedata.bidirectional(" ")    # 'WS' (Whitespace)
unicodedata.bidirectional("\u200E")  # 'L'  (LRM acts as strong LTR)
unicodedata.bidirectional("\u200F")  # 'R'  (RLM acts as strong RTL)

JavaScript

// Insert bidi marks programmatically
const LRM = "\u200E";
const RLM = "\u200F";

function isolateBidi(text) {
    // Wrap in First Strong Isolate / Pop Directional Isolate
    return "\u2068" + text + "\u2069";
}

const username = "\u0645\u062D\u0645\u062F";  // محمد
const msg = `Logged in as ${isolateBidi(username)}.`;

Java

// Check bidi class
Character.getDirectionality('A');
// → Character.DIRECTIONALITY_LEFT_TO_RIGHT (0)

Character.getDirectionality('\u0627');
// → Character.DIRECTIONALITY_RIGHT_TO_LEFT_ARABIC (2)

// java.text.Bidi class for paragraph-level analysis
java.text.Bidi bidi = new java.text.Bidi("Hello \u0645\u0631\u062D\u0628\u0627", 0);
bidi.getBaseLevel();     // 0 (LTR paragraph)
bidi.getRunCount();      // 3 (LTR run, RTL run, LTR run)
bidi.isLeftToRight();    // false (mixed)
bidi.isMixed();          // true

Testing Bidi Behavior

Use these test strings to verify that your application handles bidirectional text correctly:

Test Case	Logical String	Expected Visual
Simple RTL	`مرحبا`	Right-to-left word
Mixed LTR-RTL	`Hello مرحبا World`	"Hello" LTR, Arabic RTL, "World" LTR
Number in RTL	`سعر 100 دولار`	Number between RTL words
Nested directions	`English عربي English عربي English`	Alternating direction runs
Paren in RTL	`see (مقال) here`	Parentheses should stay with content
URL in RTL	`انظر https://x.com/ هنا`	URL should remain LTR

Best Practices

Always set dir on your <html> element: Declare the base direction explicitly rather than relying on browser defaults.
Use dir="auto" for user-generated content: Let the browser detect the direction from the first strong character. Wrap in <bdi> when the content is inline within a sentence.
Prefer isolates over embeddings: Use LRI/RLI/FSI with PDI (Unicode 6.3+) or the HTML <bdi> element. Avoid the legacy LRE/RLE/LRO/RRO characters.
Test with real RTL text: Use actual Arabic or Hebrew strings in your test suite, not just LTR text with dir="rtl".
Scan source code for bidi controls: Add a pre-commit hook or CI check that flags U+202A–U+202E and U+2066–U+2069 in source files to prevent Trojan Source attacks.
Use CSS unicode-bidi: isolate: When setting direction in CSS, always pair it with unicode-bidi: isolate to prevent directional leakage.
Handle numbers carefully: European digits and Arabic-Indic digits have different bidi classes. Test formatting of prices, dates, and phone numbers in both LTR and RTL contexts.
Mind the invisible characters: Bidi controls are zero-width and invisible. When debugging bidi issues, use a tool that reveals them — such as the UnicodeFYI character analyzer or a hex editor.