📚 Unicode Fundamentals

The Unicode Bidirectional Algorithm

The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of left-to-right and right-to-left scripts is displayed on screen. This guide explains how the algorithm works and the security risks that arise from abusing bidirectional control characters.

·

Most writing systems flow in one direction — English runs left-to-right (LTR), Arabic and Hebrew run right-to-left (RTL). But real-world text is rarely that simple. An Arabic sentence that mentions a product name in English, a Hebrew paragraph containing a JavaScript code snippet, or an English UI label with an Arabic user name — these all mix scripts with opposite directionalities. The Unicode Bidirectional Algorithm (UBA, commonly called "Bidi") is the specification that determines how such mixed text is ordered for display.

Why Bidi Is Hard

Consider this logical string stored in memory:

"The word مرحبا means hello"

In memory, the characters are stored in logical order — the order you would type them. But on screen, the Arabic word must render right-to-left while the English words render left-to-right. The display engine must reorder the characters into visual order without any explicit instructions from the author.

The UBA handles this automatically in most cases. But when it guesses wrong — or when the text contains numbers, punctuation, or nested direction changes — the result can be confusing or even dangerous.

Key Terminology

Term Meaning
LTR Left-to-right (English, Latin, Cyrillic, Greek, CJK)
RTL Right-to-left (Arabic, Hebrew, Thaana, Syriac, N'Ko)
Bidi Bidirectional — text that contains both LTR and RTL runs
Logical order The order characters are stored in memory (typing order)
Visual order The order characters appear on screen after reordering
Run A maximal contiguous sequence of characters with the same direction
Embedding level A number (0–125) assigned to each character indicating its nesting depth and direction
Base direction The paragraph's overall direction (determines alignment and default behavior)

How the Algorithm Works

The UBA, defined in Unicode Standard Annex #9, operates in four phases:

Phase 1: Determine the Paragraph Embedding Level

The algorithm scans the paragraph for the first character with a strong directional type (a letter, not a number or punctuation). If that character is LTR (e.g., Latin, Greek, CJK), the paragraph embedding level is 0 (LTR). If it is RTL (e.g., Arabic, Hebrew), the level is 1 (RTL).

This paragraph level determines text alignment (left for LTR, right for RTL) and how ambiguous characters (like spaces and punctuation) are positioned.

Phase 2: Determine Explicit Embedding Levels

Explicit bidi control characters (like LRE, RLE, LRO, RRO, PDF, and the newer LRI, RLI, FSI, PDI) override or isolate the direction of text spans. The algorithm processes these to assign an embedding level to every character:

  • Even levels (0, 2, 4, ...) = LTR
  • Odd levels (1, 3, 5, ...) = RTL

Phase 3: Resolve Weak and Neutral Types

Many characters do not have an inherent direction:

Type Examples Behavior
Strong Letters (A, ا, א) Direction is intrinsic
Weak Numbers (0–9), currency signs, combiners Direction depends on context
Neutral Spaces, punctuation (. , ! ?), paragraph separators Direction inherited from surrounding characters

The algorithm resolves weak types first (numbers adopt the direction of their surrounding letters) and then neutral types (a space between two RTL words becomes RTL; a space between an RTL and LTR word resolves based on the embedding level).

Phase 4: Reorder for Display

Finally, the algorithm reverses each RTL run in place, producing the visual order. Even-level runs stay in logical order; odd-level runs are reversed.

A Worked Example

Logical order in memory (L = Latin letter, A = Arabic letter):

L L L   A A A A   L L
H e y   م ر ح ب   O K

Step 1: First strong character is 'H' (Latin) → paragraph level = 0 (LTR).

Step 2: No explicit embedding controls → default levels apply: - H, e, y: level 0 (LTR) - Arabic letters: level 1 (RTL) - O, K: level 0 (LTR) - Spaces: resolved to adjacent levels

Step 3: Spaces between "Hey" and Arabic → neutral, resolved to level 0 (LTR context). Space between Arabic and "OK" → same.

Step 4: Reverse level-1 runs. The Arabic letters are reversed for display:

Visual: H e y   ب ح ر م   O K

The Arabic word, which was stored as mem-ra-ha-ba in logical order, is displayed as ba-ha-ra-mem (right-to-left) — which is correct for Arabic readers.

Bidi Character Types

Every Unicode character has a Bidi_Class property. The most important classes:

Class Name Examples
L Left-to-Right Latin, Greek, Cyrillic, CJK, Thai
R Right-to-Left Hebrew, N'Ko, Thaana
AL Arabic Letter Arabic, Syriac, Mandaic
EN European Number 0 1 2 3 4 5 6 7 8 9
AN Arabic Number Arabic-Indic digits: ٠ ١ ٢ ٣
ES European Separator + -
CS Common Separator , . : /
ET European Terminator $ EUR % #
ON Other Neutral ! ? @ { }
WS Whitespace Space, tab, line separators
B Paragraph Separator Line feed, paragraph separator
BN Boundary Neutral Most control characters, BOM

You can look up any character's Bidi_Class in the Unicode Character Database or on UnicodeFYI.com character pages.

Explicit Bidi Control Characters

When the automatic algorithm produces wrong results, you can insert invisible control characters to correct the direction:

Marks (Zero-Width)

Character Code Point Purpose
LRM (Left-to-Right Mark) U+200E Forces LTR context at insertion point
RLM (Right-to-Left Mark) U+200F Forces RTL context at insertion point
ALM (Arabic Letter Mark) U+061C Forces Arabic Letter context (Unicode 6.3+)
Character Code Point Purpose
LRI (Left-to-Right Isolate) U+2066 Start an LTR-isolated span
RLI (Right-to-Left Isolate) U+2067 Start an RTL-isolated span
FSI (First Strong Isolate) U+2068 Auto-detect direction for the span
PDI (Pop Directional Isolate) U+2069 End the most recent isolate

Legacy Embeddings (Deprecated)

Character Code Point Purpose
LRE (Left-to-Right Embedding) U+202A Start LTR embedding (deprecated)
RLE (Right-to-Left Embedding) U+202B Start RTL embedding (deprecated)
LRO (Left-to-Right Override) U+202D Force LTR for all characters (deprecated)
RLO (Right-to-Left Override) U+202E Force RTL for all characters (deprecated)
PDF (Pop Directional Formatting) U+202C End the most recent embedding/override

Always prefer isolates over embeddings. Isolates were introduced in Unicode 6.3 specifically because embeddings have a well-known flaw: they leak directional influence into surrounding text. Isolates create a clean boundary.

Common Bidi Bugs and Fixes

Bug 1: Punctuation Sticks to the Wrong Side

English text: "see مقال (article)"

The parentheses and the word "article" should be LTR, but the bidi algorithm may pull the opening parenthesis toward the Arabic text, displaying something like:

see مقال) article(

Fix: Insert a Left-to-Right Mark (LRM, U+200E) after the Arabic text:

see مقال‎ (article)
         ^ LRM here (invisible)

In HTML: see مقال‎ (article)

Bug 2: Numbers in RTL Context

Arabic text with a number:

Logical: سعر ١٢٣ دولار

European digits (0–9) and Arabic-Indic digits (٠–٩) have different bidi classes. European digits are "European Number" (EN), which the algorithm treats as LTR in many contexts. This can cause numbers to appear on the wrong side of adjacent text.

Fix: Use FSI/PDI isolates around the number, or ensure the paragraph direction is set correctly with the dir attribute in HTML.

Bug 3: File Paths and URLs in RTL Pages

A URL like https://example.com/path/to/page embedded in an RTL paragraph may render with the slashes in wrong positions:

Displayed: https://example.com/path/to/page  ← might become garbled

Fix: Wrap the URL in an LTR isolate:

<a href="..." dir="ltr">https://example.com/path/to/page</a>

Or use the <bdi> element, which automatically isolates its content:

<bdi>https://example.com/path/to/page</bdi>

Bug 4: User-Generated Content Injection

When displaying a username from user input inside a sentence, the username's direction can corrupt the surrounding layout:

<!-- Dangerous: user-supplied name can flip the sentence -->
<p>Logged in as USERNAME.</p>

If USERNAME contains RTL characters, the period and surrounding text may reorder unexpectedly.

Fix: Always isolate user-generated content:

<p>Logged in as <bdi>USERNAME</bdi>.</p>

HTML and CSS Bidi Controls

The dir Attribute

The dir attribute on HTML elements is the primary mechanism for controlling text direction:

<html dir="rtl" lang="ar">        <!-- RTL page -->
<p dir="ltr">English paragraph</p> <!-- LTR override within RTL page -->
<p dir="auto">User content</p>     <!-- Auto-detect from first strong char -->

The dir="auto" value is particularly useful for user-generated content — it examines the first strong character and sets the direction accordingly.

The <bdo> and <bdi> Elements

<!-- bdo: Override direction (force all characters to this direction) -->
<bdo dir="rtl">Hello</bdo>   <!-- Renders: olleH -->

<!-- bdi: Isolate content (prevent it from affecting surroundings) -->
<p>User <bdi>محمد</bdi> posted 3 comments.</p>

CSS Properties

/* Set direction on an element */
.rtl-block {
    direction: rtl;
    unicode-bidi: isolate;      /* Recommended: isolate this element */
}

/* unicode-bidi values:
   normal         — default, no special behavior
   embed          — legacy, opens an embedding (avoid)
   isolate        — recommended, creates an isolate
   bidi-override  — forces direction on all content
   isolate-override — isolate + override combined
   plaintext      — determines direction from content, ignoring parent
*/

Important: The CSS direction property should only be used for document structure and layout. For inline text direction, prefer HTML dir attributes and Unicode bidi characters.

Bidi and Security: The Trojan Source Attack

In 2021, researchers at Cambridge described the Trojan Source attack (CVE-2021-42574), which exploits bidi override characters to make source code appear different from what it actually does.

Consider this Python code:

access_level = "user\u202e \u2066# Check if admin\u2069 \u2066"

The RLO (U+202E) and LRI/PDI characters make the string look like a comment in some editors, but it is actually executable code with a different value.

Defenses:

  1. Lint for bidi controls: Reject source files containing U+202A–U+202E, U+2066–U+2069 in string literals and comments. Compilers like Rust and GCC now warn by default.

  2. Render bidi controls visibly: IDEs like VS Code show bidi control characters as visible glyphs.

  3. Code review: Review diffs in an environment that reveals invisible characters.

  4. GitHub: GitHub now highlights bidi characters in source code views.

Bidi in Programming Languages

Python

import unicodedata

# Check bidi class of a character
unicodedata.bidirectional("A")    # 'L'  (Left-to-Right)
unicodedata.bidirectional("\u0627")  # 'AL' (Arabic Letter) — alef
unicodedata.bidirectional("5")    # 'EN' (European Number)
unicodedata.bidirectional(" ")    # 'WS' (Whitespace)
unicodedata.bidirectional("\u200E")  # 'L'  (LRM acts as strong LTR)
unicodedata.bidirectional("\u200F")  # 'R'  (RLM acts as strong RTL)

JavaScript

// Insert bidi marks programmatically
const LRM = "\u200E";
const RLM = "\u200F";

function isolateBidi(text) {
    // Wrap in First Strong Isolate / Pop Directional Isolate
    return "\u2068" + text + "\u2069";
}

const username = "\u0645\u062D\u0645\u062F";  // محمد
const msg = `Logged in as ${isolateBidi(username)}.`;

Java

// Check bidi class
Character.getDirectionality('A');
// → Character.DIRECTIONALITY_LEFT_TO_RIGHT (0)

Character.getDirectionality('\u0627');
// → Character.DIRECTIONALITY_RIGHT_TO_LEFT_ARABIC (2)

// java.text.Bidi class for paragraph-level analysis
java.text.Bidi bidi = new java.text.Bidi("Hello \u0645\u0631\u062D\u0628\u0627", 0);
bidi.getBaseLevel();     // 0 (LTR paragraph)
bidi.getRunCount();      // 3 (LTR run, RTL run, LTR run)
bidi.isLeftToRight();    // false (mixed)
bidi.isMixed();          // true

Testing Bidi Behavior

Use these test strings to verify that your application handles bidirectional text correctly:

Test Case Logical String Expected Visual
Simple RTL مرحبا Right-to-left word
Mixed LTR-RTL Hello مرحبا World "Hello" LTR, Arabic RTL, "World" LTR
Number in RTL سعر 100 دولار Number between RTL words
Nested directions English عربي English عربي English Alternating direction runs
Paren in RTL see (مقال) here Parentheses should stay with content
URL in RTL انظر https://x.com/ هنا URL should remain LTR

Best Practices

  1. Always set dir on your <html> element: Declare the base direction explicitly rather than relying on browser defaults.

  2. Use dir="auto" for user-generated content: Let the browser detect the direction from the first strong character. Wrap in <bdi> when the content is inline within a sentence.

  3. Prefer isolates over embeddings: Use LRI/RLI/FSI with PDI (Unicode 6.3+) or the HTML <bdi> element. Avoid the legacy LRE/RLE/LRO/RRO characters.

  4. Test with real RTL text: Use actual Arabic or Hebrew strings in your test suite, not just LTR text with dir="rtl".

  5. Scan source code for bidi controls: Add a pre-commit hook or CI check that flags U+202A–U+202E and U+2066–U+2069 in source files to prevent Trojan Source attacks.

  6. Use CSS unicode-bidi: isolate: When setting direction in CSS, always pair it with unicode-bidi: isolate to prevent directional leakage.

  7. Handle numbers carefully: European digits and Arabic-Indic digits have different bidi classes. Test formatting of prices, dates, and phone numbers in both LTR and RTL contexts.

  8. Mind the invisible characters: Bidi controls are zero-width and invisible. When debugging bidi issues, use a tool that reveals them — such as the UnicodeFYI character analyzer or a hex editor.

Más en Unicode Fundamentals

What is Unicode? A Complete Guide

Unicode is the universal character encoding standard that assigns a unique number …

UTF-8 Encoding Explained

UTF-8 is the dominant character encoding on the web, capable of representing …

UTF-8 vs UTF-16 vs UTF-32: When to Use Each

UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …

What is a Unicode Code Point?

A Unicode code point is the unique number assigned to each character …

Unicode Planes and the BMP

Unicode is divided into 17 planes, each containing up to 65,536 code …

Understanding Byte Order Mark (BOM)

The Byte Order Mark (BOM) is a special Unicode character used at …

Surrogate Pairs Explained

Surrogate pairs are a mechanism in UTF-16 that allows code points outside …

ASCII to Unicode: The Evolution of Character Encoding

ASCII defined 128 characters for the English alphabet and was the foundation …

Unicode Normalization: NFC, NFD, NFKC, NFKD

The same visible character can be represented by multiple different byte sequences …

Unicode General Categories Explained

Every Unicode character belongs to a general category such as Letter, Number, …

Understanding Unicode Blocks

Unicode blocks are contiguous ranges of code points grouped by script or …

Unicode Scripts: How Writing Systems are Organized

Unicode assigns every character to a script property that identifies the writing …

What are Combining Characters?

Combining characters are Unicode code points that attach to a preceding base …

Grapheme Clusters vs Code Points

A single visible character on screen — called a grapheme cluster — …

Unicode Confusables: A Security Guide

Unicode confusables are characters that look identical or nearly identical to others, …

Zero Width Characters: What They Are and Why They Matter

Zero-width characters are invisible Unicode code points that affect text layout, joining, …

Unicode Whitespace Characters Guide

Unicode defines over two dozen whitespace characters beyond the ordinary space, including …

History of Unicode

Unicode began in 1987 as a collaboration between engineers at Apple and …

Unicode Versions Timeline

Unicode has released major versions regularly since 1.0 in 1991, with each …