🔧 Practical Unicode

Unicode Text Direction: LTR vs RTL

Unicode supports both left-to-right and right-to-left text through the bidirectional algorithm and explicit directional control characters, enabling correct display of Arabic, Hebrew, and other RTL scripts alongside LTR text. This guide explains text direction in Unicode, how to use the dir attribute in HTML, and common RTL layout mistakes to avoid.

·

Most of the world's writing systems flow left-to-right (LTR) — Latin, Cyrillic, Greek, Devanagari, Thai, and many others. But several major scripts flow right-to-left (RTL) — Arabic, Hebrew, Syriac, and Thaana among them. A few East Asian scripts can also be written vertically. Unicode handles all of these directions through the Unicode Bidirectional Algorithm (UBA), a set of rules for determining display order when text contains a mix of LTR and RTL characters. This guide explains how text direction works in Unicode, how to control it in HTML and CSS, and how to avoid the most common pitfalls in mixed-direction text.

Writing System Directions

Unicode assigns a Bidi_Class property to every character, which determines its default direction:

Bidi Class Direction Examples
L Left-to-right Latin, Cyrillic, Greek, Devanagari, Thai
R Right-to-left Hebrew, Syriac, Thaana
AL Right-to-left Arabic Arabic, Urdu, Pashto
EN European number 0-9
AN Arabic number ٠-٩
CS Common separator , . : (context-dependent)
NSM Nonspacing mark Combining characters (direction inherited)
BN Boundary neutral Format characters (U+200B, etc.)
WS Whitespace Spaces, tabs
ON Other neutral Most symbols and punctuation

The critical point: numbers and punctuation are directionally neutral — their visual order depends on the surrounding text. This is where most confusion arises.

The Unicode Bidirectional Algorithm (UBA)

The UBA (Unicode Technical Report #9) is the algorithm that determines the visual ordering of characters when a paragraph contains both LTR and RTL text. Every Unicode-compliant text rendering system — browsers, operating systems, word processors — implements this algorithm.

How It Works

The algorithm processes text in several phases:

  1. Determine paragraph direction: Based on the first strong character (L, R, or AL), or an explicit override
  2. Resolve embedding levels: Assign a numerical "embedding level" to each character (even = LTR, odd = RTL)
  3. Resolve weak and neutral types: Determine the direction of numbers, punctuation, and whitespace based on context
  4. Reverse RTL runs: Reorder characters within each directional run for display

A Concrete Example

Consider this mixed text (paragraph direction is LTR):

Storage text:   "The word שלום means peace."
Code points:    T h e   w o r d   ש ל ו ם   m e a n s   p e a c e .
Bidi classes:   L L L   L L L L   R  R  R  R    L L L L L   L L L L L

The UBA detects that שלום forms an RTL run within an LTR paragraph. The display result:

Display:        "The word םולש means peace."

The Hebrew characters are visually reversed (right-to-left) while the overall paragraph remains left-to-right. The cursor would move right through "The word", then jump to the right side of the Hebrew run and move left through it, then jump forward again for "means peace."

HTML and the dir Attribute

HTML provides the dir attribute to set text direction explicitly:

<!-- Set document direction -->
<html dir="rtl" lang="ar">

<!-- Override for a specific element -->
<p dir="rtl">مرحبا بالعالم</p>

<!-- Auto-detect direction from content -->
<p dir="auto">User-submitted text goes here</p>

The dir="auto" Attribute

For user-generated content where you cannot predict the direction, dir="auto" tells the browser to apply the UBA's "first strong character" heuristic:

<!-- First strong character is Latin → LTR -->
<p dir="auto">Hello world</p>

<!-- First strong character is Arabic → RTL -->
<p dir="auto">مرحبا</p>

This is essential for any input field, comment section, or content area that might receive text in any language.

The <bdi> and <bdo> Elements

HTML5 provides two elements specifically for bidirectional text control:

<!-- <bdi>: Bidirectional Isolate -->
<!-- Isolates inline content from surrounding directional context -->
<p>User <bdi>أحمد</bdi> posted 3 comments</p>

<!-- <bdo>: Bidirectional Override -->
<!-- Forces a specific direction regardless of content -->
<bdo dir="ltr">Force عربي left-to-right</bdo>

The <bdi> element is critical for preventing spillover — the phenomenon where an RTL name or string causes surrounding punctuation or numbers to reorder incorrectly.

CSS Direction Properties

CSS provides several properties for controlling text direction:

/* Base direction */
.rtl-container {
  direction: rtl;
  unicode-bidi: embed;
}

/* Isolate an element's content from surroundings */
.isolated {
  unicode-bidi: isolate;
}

/* Override direction completely */
.forced-ltr {
  direction: ltr;
  unicode-bidi: bidi-override;
}

The unicode-bidi Values

Value Effect
normal No additional embedding
embed Open an embedding level in the specified direction
isolate Isolate content from surrounding bidi context (like <bdi>)
bidi-override Force all text to the specified direction
isolate-override Combine isolation and override
plaintext Determine direction from content (like dir="auto")

Writing Mode: Vertical Text

For vertical text layouts (traditional CJK, Mongolian), CSS provides writing-mode:

/* Vertical, right-to-left columns (traditional Chinese/Japanese) */
.vertical-rl {
  writing-mode: vertical-rl;
}

/* Vertical, left-to-right columns (Mongolian) */
.vertical-lr {
  writing-mode: vertical-lr;
}

/* Horizontal, top-to-bottom (default) */
.horizontal {
  writing-mode: horizontal-tb;
}

When writing-mode is vertical, the direction property controls whether lines progress from right-to-left (traditional CJK) or left-to-right (Mongolian).

Unicode Direction Control Characters

In addition to HTML attributes, Unicode provides invisible formatting characters for fine-grained direction control:

Character Code Point Effect
LRM (Left-to-Right Mark) U+200E Strong LTR marker
RLM (Right-to-Left Mark) U+200F Strong RTL marker
LRE (Left-to-Right Embedding) U+202A Start LTR embedding
RLE (Right-to-Left Embedding) U+202B Start RTL embedding
PDF (Pop Directional Formatting) U+202C End embedding/override
LRO (Left-to-Right Override) U+202D Force LTR
RLO (Right-to-Left Override) U+202E Force RTL
LRI (Left-to-Right Isolate) U+2066 Start LTR isolate
RLI (Right-to-Left Isolate) U+2067 Start RTL isolate
FSI (First Strong Isolate) U+2068 Auto-detect isolate
PDI (Pop Directional Isolate) U+2069 End isolate

Modern Best Practice: Isolates Over Embeddings

The older embedding characters (LRE, RLE, LRO, RLO, PDF) are now considered legacy. The newer isolate characters (LRI, RLI, FSI, PDI) are preferred because they prevent directional spillover — one of the most common sources of bidi bugs.

# Insert isolates around a user-generated string
LRI = "\u2066"  # Left-to-Right Isolate
PDI = "\u2069"  # Pop Directional Isolate

user_name = "\u0623\u062d\u0645\u062f"
message = f"User {LRI}{user_name}{PDI} posted a comment."

Common Pitfalls and Solutions

1. Numbers Next to RTL Text

Numbers are directionally neutral, so they can reorder unexpectedly:

Intended:    "מחיר: 50 שקל"
May render:  "50 מחיר: שקל"

Fix: Use LRM/RLM marks or <bdi> to anchor the number.

2. Punctuation at Boundaries

A period or parenthesis between LTR and RTL runs may jump to the wrong side:

Intended:    "See שלום (peace) for details."
May render:  "See (peace) םולש for details."

Fix: Wrap the foreign-script segment in <bdi> or isolate characters.

3. Source Code and Bidi

The Trojan Source vulnerability (CVE-2021-42574) demonstrated that bidi override characters (RLO, LRI) can be embedded in source code to make malicious code look benign. For example, a string that appears to contain a safe URL could actually contain a different one when the bidi overrides are resolved.

Fix: Linters and editors should flag invisible bidi control characters in source files. GitHub, GitLab, and most modern code editors now warn about this.

4. File Paths and URLs

URLs and file paths are always LTR in protocol terms, but embedding them in RTL text can cause visual reordering of slashes and directory separators.

Fix: Always wrap URLs and file paths in LTR isolates when embedding them in RTL text.

Key Takeaways

  1. The UBA is automatic — every Unicode-compliant renderer implements it, so LTR and RTL text generally "just works" for simple cases
  2. Use dir="auto" on any element that may contain user-generated text
  3. Use <bdi> to isolate names, numbers, and other inline content from surrounding directional context
  4. Prefer isolates over embeddings — U+2066-2069 are safer than U+202A-202E
  5. Test with real RTL text — rendering engines can disagree on edge cases
  6. Watch for bidi in source code — invisible direction characters are a security risk
  7. Vertical text is handled by CSS writing-mode, not by Unicode direction characters

Thêm trong Practical Unicode

How to Type Special Characters on Windows

Windows provides several methods for typing special characters and Unicode symbols, including …

How to Type Special Characters on Mac

macOS makes it easy to type special characters and Unicode symbols through …

How to Type Special Characters on Linux

Linux offers multiple ways to insert Unicode characters, including Ctrl+Shift+U followed by …

Special Characters on Mobile (iOS/Android)

Typing special Unicode characters on smartphones requires different techniques than on desktop …

How to Fix Mojibake (Garbled Text)

Mojibake is the garbled text you see when a file encoded in …

Unicode in Databases

Storing Unicode text in a database requires choosing the right charset, collation, …

Unicode in Filenames

Modern operating systems support Unicode filenames, but different filesystems use different encodings …

Unicode in Email

Email evolved from ASCII-only systems, and supporting Unicode in email subjects, bodies, …

Unicode in Domain Names (IDN)

Internationalized Domain Names (IDNs) allow domain names to contain non-ASCII characters from …

Unicode for Accessibility

Using Unicode symbols, special characters, and emoji in web content has important …

Unicode Fonts: How Characters Get Rendered

A font file only contains glyphs for a subset of Unicode characters, …

How to Find Any Unicode Character

Finding the exact Unicode character you need can be challenging given over …

Unicode Copy and Paste Best Practices

Copying and pasting text between applications can introduce invisible characters, change normalization …

How to Create Fancy Text with Unicode

Unicode's Mathematical Alphanumeric Symbols block and other areas contain bold, italic, script, …