What is Unicode 双方向アルゴリズム (UBA)?

文字の双方向カテゴリと明示的な方向オーバーライドを使って、混在方向テキスト（例：英語＋アラビア語）の表示順序を決定するアルゴリズム。

What is RTL（右から左）?

文字が右から左に流れるテキスト方向。アラビア語・ヘブライ語・ターナ文字などで使われ、正しい表示のために双方向アルゴリズムが必要です。

セキュリティ

Bidi Text Attack

Exploiting Unicode bidirectional control characters to disguise malicious code or filenames. The 'Trojan Source' attack (CVE-2021-42574) uses bidi overrides to hide backdoors in source code.

What is a Bidirectional Text Attack?

A bidirectional text attack (also called a Bidi attack or Trojan Source attack) exploits the Unicode Bidirectional Algorithm (UBA) to make text appear to have different content than it actually contains. Because developers, reviewers, and administrators read text through rendering engines that apply the Bidi algorithm, they may see a safe-looking string while the underlying bytes contain something entirely different.

The Unicode Bidirectional Algorithm

The Unicode Bidirectional Algorithm (UAX#9) allows a single string to contain mixed left-to-right and right-to-left text — for example, an English sentence with an embedded Arabic phrase. It achieves this through invisible control characters that shift the rendering direction. The key control characters include:

RLO (U+202E) — Right-to-Left Override: forces all following characters to render right-to-left
LRO (U+202D) — Left-to-Right Override: forces left-to-right rendering
RLE (U+202B) — Right-to-Left Embedding
PDF (U+202C) — Pop Directional Formatting (ends an override or embedding)
RLI (U+2067), LRI (U+2066), FSI (U+2068) — isolate variants (safer)
PDI (U+2069) — Pop Directional Isolate

Trojan Source — CVE-2021-42574

In November 2021, researchers Nicholas Boucher and Ross Anderson published Trojan Source, demonstrating how Bidi control characters can be used to inject malicious code into source files in a way that is invisible during code review but interpreted differently by the compiler.

The classic Trojan Source example uses a comment to hide a string that contains an early string terminator and malicious logic:

// The attack (conceptual — do not copy literally into editors)
// access_level = "user\u202E \u2066// Check if admin\u2069 \u2066"

// What the compiler sees:  access_level = "user"  // followed by active code
// What reviewers see (rendered):  access_level = "user  // Check if admin"

The RLO and isolate characters cause the code review tool to reverse the display of a portion of the string, making the comment appear to close before the malicious content but actually not doing so in the source bytes.

RLO-Based Filename Disguise

Long before Trojan Source, attackers used U+202E (RLO) to disguise executable file extensions in filenames. A file named:

Invoice_[U+202E]gpj.exe

is displayed by Windows Explorer as Invoice_exe.jpg — the extension appears to be .jpg because RLO reverses the display of the characters after the control character. Users double-clicking the "image" run the .exe file.

GitHub and GitLab Mitigations

Following the Trojan Source disclosure, major code hosting platforms introduced countermeasures:

GitHub added a warning banner on any file view that contains Bidi override or embedding characters, stating "This file contains bidirectional Unicode text that may be interpreted differently than what appears below."
GitLab implemented similar warnings in the diff view and file viewer.
gcc and clang compilers added warnings for Bidi control characters in string literals and comments.
CVE-2021-42574 was issued and patched in multiple compilers and editors.

Defense Strategies

Lint for Bidi control characters — add a pre-commit hook or CI check that rejects files containing U+202A–U+202E, U+2066–U+2069
Configure editors — VS Code, JetBrains IDEs, and Vim can be configured to render Bidi control characters visibly
Audit existing code — search codebases for the hex byte sequences: E2 80 AA through E2 80 AE (UTF-8 for U+202A–U+202E)

Quick Facts

Property	Value
CVE	CVE-2021-42574 (Trojan Source)
Researchers	Nicholas Boucher and Ross Anderson, Cambridge
Publication date	November 2021
Key control chars	U+202E (RLO), U+202D (LRO), U+202B (RLE)
Attack surfaces	Source code review, filenames, web content, emails
Compiler mitigations	gcc `-Wbidi-chars`, clang warning added
Platform mitigations	GitHub/GitLab Bidi warning banners
Unicode standard	UAX#9 — Unicode Bidirectional Algorithm

セキュリティのその他の用語

Bidi オーバーライド攻撃

Unicode双方向オーバーライド文字（U+202A〜U+202E・U+2066〜U+2069）を使って悪意のあるファイル名やコードを偽装する攻撃。'readme‮fdp.exe'は'readmeexe.pdf'と表示されます。

IDN ホモグラフ攻撃

ドメイン名に視覚的に似たUnicode文字を使って正規サイトになりすます攻撃。аpple.com（キリルа）はapple.comに見えます。ブラウザはPunycodeの表示ルールで防御します。

Normalization Attack

Exploiting Unicode normalization to bypass security filters. Input validated before normalization may …

Unicode スプーフィング

Unicode機能を使ってユーザーを欺くこと：偽ドメインのためのホモグリフ・偽ファイル拡張子のためのBidiオーバーライド・隠しテキストのための不可視文字。

ゼロ幅接合子 (ZWJ)

U+200D。隣接する文字の結合を要求します。絵文字シーケンスに不可欠です（👩+ZWJ+💻=👩‍💻）。インド系文字では合字形成を要求します。テキスト境界を隠すためにも使われます。

ゼロ幅非接合子 (ZWNJ)

U+200C。隣接する文字の結合を防ぎます。ペルシャ語/アラビア語で正しい文字形態のために必須で、デーヴァナーガリーで合字を防ぐためにも使われます。

ホモグリフ

異なるスクリプトから来た同一または非常に似て見える文字。例：ラテン'a'とキリル'а'。フィッシング・スプーフィング・ソーシャルエンジニアリング攻撃に使われます。

混同しやすい文字

confusables.txt（UCD）で定義された、視覚的に混同しやすい文字ペアに対するUnicodeの公式用語。ホモグリフより広い概念で、単に似ているだけの文字も含みます。

混在スクリプト検出

異なるスクリプトの文字を混在させるテキストを識別します（例：ラテン＋キリル）。ホモグリフ攻撃に対する主要な防御で、ブラウザはこれを使ってPunycode表示をトリガーします。

← 用語集へ