🔧 Practical Unicode

Unicode Text Direction: LTR vs RTL

Unicode supports both left-to-right and right-to-left text through the bidirectional algorithm and explicit directional control characters, enabling correct display of Arabic, Hebrew, and other RTL scripts alongside LTR text. This guide explains text direction in Unicode, how to use the dir attribute in HTML, and common RTL layout mistakes to avoid.

Published 2024-04-08 · Updated 2024-12-02

Most of the world's writing systems flow left-to-right (LTR) — Latin, Cyrillic, Greek, Devanagari, Thai, and many others. But several major scripts flow right-to-left (RTL) — Arabic, Hebrew, Syriac, and Thaana among them. A few East Asian scripts can also be written vertically. Unicode handles all of these directions through the Unicode Bidirectional Algorithm (UBA), a set of rules for determining display order when text contains a mix of LTR and RTL characters. This guide explains how text direction works in Unicode, how to control it in HTML and CSS, and how to avoid the most common pitfalls in mixed-direction text.

Writing System Directions

Unicode assigns a Bidi_Class property to every character, which determines its default direction:

Bidi Class	Direction	Examples
L	Left-to-right	Latin, Cyrillic, Greek, Devanagari, Thai
R	Right-to-left	Hebrew, Syriac, Thaana
AL	Right-to-left Arabic	Arabic, Urdu, Pashto
EN	European number	0-9
AN	Arabic number	٠-٩
CS	Common separator	, . : (context-dependent)
NSM	Nonspacing mark	Combining characters (direction inherited)
BN	Boundary neutral	Format characters (U+200B, etc.)
WS	Whitespace	Spaces, tabs
ON	Other neutral	Most symbols and punctuation

The critical point: numbers and punctuation are directionally neutral — their visual order depends on the surrounding text. This is where most confusion arises.

The Unicode Bidirectional Algorithm (UBA)

The UBA (Unicode Technical Report #9) is the algorithm that determines the visual ordering of characters when a paragraph contains both LTR and RTL text. Every Unicode-compliant text rendering system — browsers, operating systems, word processors — implements this algorithm.

How It Works

The algorithm processes text in several phases:

Determine paragraph direction: Based on the first strong character (L, R, or AL), or an explicit override
Resolve embedding levels: Assign a numerical "embedding level" to each character (even = LTR, odd = RTL)
Resolve weak and neutral types: Determine the direction of numbers, punctuation, and whitespace based on context
Reverse RTL runs: Reorder characters within each directional run for display

A Concrete Example

Consider this mixed text (paragraph direction is LTR):

Storage text:   "The word שלום means peace."
Code points:    T h e   w o r d   ש ל ו ם   m e a n s   p e a c e .
Bidi classes:   L L L   L L L L   R  R  R  R    L L L L L   L L L L L

The UBA detects that שלום forms an RTL run within an LTR paragraph. The display result:

Display:        "The word םולש means peace."

The Hebrew characters are visually reversed (right-to-left) while the overall paragraph remains left-to-right. The cursor would move right through "The word", then jump to the right side of the Hebrew run and move left through it, then jump forward again for "means peace."

HTML and the `dir` Attribute

HTML provides the dir attribute to set text direction explicitly:

<!-- Set document direction -->
<html dir="rtl" lang="ar">

<!-- Override for a specific element -->
<p dir="rtl">مرحبا بالعالم</p>

<!-- Auto-detect direction from content -->
<p dir="auto">User-submitted text goes here</p>

The `dir="auto"` Attribute

For user-generated content where you cannot predict the direction, dir="auto" tells the browser to apply the UBA's "first strong character" heuristic:

<!-- First strong character is Latin → LTR -->
<p dir="auto">Hello world</p>

<!-- First strong character is Arabic → RTL -->
<p dir="auto">مرحبا</p>

This is essential for any input field, comment section, or content area that might receive text in any language.

The `<bdi>` and `<bdo>` Elements

HTML5 provides two elements specifically for bidirectional text control:

<!-- <bdi>: Bidirectional Isolate -->
<!-- Isolates inline content from surrounding directional context -->
<p>User <bdi>أحمد</bdi> posted 3 comments</p>

<!-- <bdo>: Bidirectional Override -->
<!-- Forces a specific direction regardless of content -->
<bdo dir="ltr">Force عربي left-to-right</bdo>

The <bdi> element is critical for preventing spillover — the phenomenon where an RTL name or string causes surrounding punctuation or numbers to reorder incorrectly.

CSS Direction Properties

CSS provides several properties for controlling text direction:

/* Base direction */
.rtl-container {
  direction: rtl;
  unicode-bidi: embed;
}

/* Isolate an element's content from surroundings */
.isolated {
  unicode-bidi: isolate;
}

/* Override direction completely */
.forced-ltr {
  direction: ltr;
  unicode-bidi: bidi-override;
}

The `unicode-bidi` Values

Value	Effect
`normal`	No additional embedding
`embed`	Open an embedding level in the specified direction
`isolate`	Isolate content from surrounding bidi context (like `<bdi>`)
`bidi-override`	Force all text to the specified direction
`isolate-override`	Combine isolation and override
`plaintext`	Determine direction from content (like `dir="auto"`)

Writing Mode: Vertical Text

For vertical text layouts (traditional CJK, Mongolian), CSS provides writing-mode:

/* Vertical, right-to-left columns (traditional Chinese/Japanese) */
.vertical-rl {
  writing-mode: vertical-rl;
}

/* Vertical, left-to-right columns (Mongolian) */
.vertical-lr {
  writing-mode: vertical-lr;
}

/* Horizontal, top-to-bottom (default) */
.horizontal {
  writing-mode: horizontal-tb;
}

When writing-mode is vertical, the direction property controls whether lines progress from right-to-left (traditional CJK) or left-to-right (Mongolian).

Unicode Direction Control Characters

In addition to HTML attributes, Unicode provides invisible formatting characters for fine-grained direction control:

Character	Code Point	Effect
LRM (Left-to-Right Mark)	U+200E	Strong LTR marker
RLM (Right-to-Left Mark)	U+200F	Strong RTL marker
LRE (Left-to-Right Embedding)	U+202A	Start LTR embedding
RLE (Right-to-Left Embedding)	U+202B	Start RTL embedding
PDF (Pop Directional Formatting)	U+202C	End embedding/override
LRO (Left-to-Right Override)	U+202D	Force LTR
RLO (Right-to-Left Override)	U+202E	Force RTL
LRI (Left-to-Right Isolate)	U+2066	Start LTR isolate
RLI (Right-to-Left Isolate)	U+2067	Start RTL isolate
FSI (First Strong Isolate)	U+2068	Auto-detect isolate
PDI (Pop Directional Isolate)	U+2069	End isolate

Modern Best Practice: Isolates Over Embeddings

The older embedding characters (LRE, RLE, LRO, RLO, PDF) are now considered legacy. The newer isolate characters (LRI, RLI, FSI, PDI) are preferred because they prevent directional spillover — one of the most common sources of bidi bugs.

# Insert isolates around a user-generated string
LRI = "\u2066"  # Left-to-Right Isolate
PDI = "\u2069"  # Pop Directional Isolate

user_name = "\u0623\u062d\u0645\u062f"
message = f"User {LRI}{user_name}{PDI} posted a comment."

Common Pitfalls and Solutions

1. Numbers Next to RTL Text

Numbers are directionally neutral, so they can reorder unexpectedly:

Intended:    "מחיר: 50 שקל"
May render:  "50 מחיר: שקל"

Fix: Use LRM/RLM marks or <bdi> to anchor the number.

2. Punctuation at Boundaries

A period or parenthesis between LTR and RTL runs may jump to the wrong side:

Intended:    "See שלום (peace) for details."
May render:  "See (peace) םולש for details."

Fix: Wrap the foreign-script segment in <bdi> or isolate characters.

3. Source Code and Bidi

The Trojan Source vulnerability (CVE-2021-42574) demonstrated that bidi override characters (RLO, LRI) can be embedded in source code to make malicious code look benign. For example, a string that appears to contain a safe URL could actually contain a different one when the bidi overrides are resolved.

Fix: Linters and editors should flag invisible bidi control characters in source files. GitHub, GitLab, and most modern code editors now warn about this.

4. File Paths and URLs

URLs and file paths are always LTR in protocol terms, but embedding them in RTL text can cause visual reordering of slashes and directory separators.

Fix: Always wrap URLs and file paths in LTR isolates when embedding them in RTL text.

Key Takeaways

The UBA is automatic — every Unicode-compliant renderer implements it, so LTR and RTL text generally "just works" for simple cases
Use dir="auto" on any element that may contain user-generated text
Use <bdi> to isolate names, numbers, and other inline content from surrounding directional context
Prefer isolates over embeddings — U+2066-2069 are safer than U+202A-202E
Test with real RTL text — rendering engines can disagree on edge cases
Watch for bidi in source code — invisible direction characters are a security risk
Vertical text is handled by CSS writing-mode, not by Unicode direction characters

Thêm trong Practical Unicode

How to Type Special Characters on Windows

Windows provides several methods for typing special characters and Unicode symbols, including …

How to Type Special Characters on Mac

macOS makes it easy to type special characters and Unicode symbols through …

How to Type Special Characters on Linux

Linux offers multiple ways to insert Unicode characters, including Ctrl+Shift+U followed by …

Special Characters on Mobile (iOS/Android)

Typing special Unicode characters on smartphones requires different techniques than on desktop …

How to Fix Mojibake (Garbled Text)

Mojibake is the garbled text you see when a file encoded in …

Unicode in Databases

Storing Unicode text in a database requires choosing the right charset, collation, …

Unicode in Filenames

Modern operating systems support Unicode filenames, but different filesystems use different encodings …

Unicode in Email

Email evolved from ASCII-only systems, and supporting Unicode in email subjects, bodies, …

Unicode in Domain Names (IDN)

Internationalized Domain Names (IDNs) allow domain names to contain non-ASCII characters from …

Unicode for Accessibility

Using Unicode symbols, special characters, and emoji in web content has important …

Unicode Fonts: How Characters Get Rendered

A font file only contains glyphs for a subset of Unicode characters, …

How to Find Any Unicode Character

Finding the exact Unicode character you need can be challenging given over …

Unicode Copy and Paste Best Practices

Copying and pasting text between applications can introduce invisible characters, change normalization …

How to Create Fancy Text with Unicode

Unicode's Mathematical Alphanumeric Symbols block and other areas contain bold, italic, script, …

← Quay lại Hướng dẫn