Bidi Text Attack
Exploiting Unicode bidirectional control characters to disguise malicious code or filenames. The 'Trojan Source' attack (CVE-2021-42574) uses bidi overrides to hide backdoors in source code.
What is a Bidirectional Text Attack?
A bidirectional text attack (also called a Bidi attack or Trojan Source attack) exploits the Unicode Bidirectional Algorithm (UBA) to make text appear to have different content than it actually contains. Because developers, reviewers, and administrators read text through rendering engines that apply the Bidi algorithm, they may see a safe-looking string while the underlying bytes contain something entirely different.
The Unicode Bidirectional Algorithm
The Unicode Bidirectional Algorithm (UAX#9) allows a single string to contain mixed left-to-right and right-to-left text — for example, an English sentence with an embedded Arabic phrase. It achieves this through invisible control characters that shift the rendering direction. The key control characters include:
- RLO (U+202E) — Right-to-Left Override: forces all following characters to render right-to-left
- LRO (U+202D) — Left-to-Right Override: forces left-to-right rendering
- RLE (U+202B) — Right-to-Left Embedding
- PDF (U+202C) — Pop Directional Formatting (ends an override or embedding)
- RLI (U+2067), LRI (U+2066), FSI (U+2068) — isolate variants (safer)
- PDI (U+2069) — Pop Directional Isolate
Trojan Source — CVE-2021-42574
In November 2021, researchers Nicholas Boucher and Ross Anderson published Trojan Source, demonstrating how Bidi control characters can be used to inject malicious code into source files in a way that is invisible during code review but interpreted differently by the compiler.
The classic Trojan Source example uses a comment to hide a string that contains an early string terminator and malicious logic:
// The attack (conceptual — do not copy literally into editors)
// access_level = "user\u202E \u2066// Check if admin\u2069 \u2066"
// What the compiler sees: access_level = "user" // followed by active code
// What reviewers see (rendered): access_level = "user // Check if admin"
The RLO and isolate characters cause the code review tool to reverse the display of a portion of the string, making the comment appear to close before the malicious content but actually not doing so in the source bytes.
RLO-Based Filename Disguise
Long before Trojan Source, attackers used U+202E (RLO) to disguise executable file extensions in filenames. A file named:
Invoice_[U+202E]gpj.exe
is displayed by Windows Explorer as Invoice_exe.jpg — the extension appears to be .jpg because RLO reverses the display of the characters after the control character. Users double-clicking the "image" run the .exe file.
GitHub and GitLab Mitigations
Following the Trojan Source disclosure, major code hosting platforms introduced countermeasures:
- GitHub added a warning banner on any file view that contains Bidi override or embedding characters, stating "This file contains bidirectional Unicode text that may be interpreted differently than what appears below."
- GitLab implemented similar warnings in the diff view and file viewer.
- gcc and clang compilers added warnings for Bidi control characters in string literals and comments.
- CVE-2021-42574 was issued and patched in multiple compilers and editors.
Defense Strategies
- Lint for Bidi control characters — add a pre-commit hook or CI check that rejects files containing U+202A–U+202E, U+2066–U+2069
- Configure editors — VS Code, JetBrains IDEs, and Vim can be configured to render Bidi control characters visibly
- Audit existing code — search codebases for the hex byte sequences:
E2 80 AAthroughE2 80 AE(UTF-8 for U+202A–U+202E)
Quick Facts
| Property | Value |
|---|---|
| CVE | CVE-2021-42574 (Trojan Source) |
| Researchers | Nicholas Boucher and Ross Anderson, Cambridge |
| Publication date | November 2021 |
| Key control chars | U+202E (RLO), U+202D (LRO), U+202B (RLE) |
| Attack surfaces | Source code review, filenames, web content, emails |
| Compiler mitigations | gcc -Wbidi-chars, clang warning added |
| Platform mitigations | GitHub/GitLab Bidi warning banners |
| Unicode standard | UAX#9 — Unicode Bidirectional Algorithm |
相关术语
安全 中的更多内容
在域名中使用视觉上相似的Unicode字符来冒充合法网站的攻击,аpple.com(西里尔а)看起来像apple.com,浏览器通过Punycode显示规则加以防范。
Exploiting Unicode normalization to bypass security filters. Input validated before normalization may …
利用Unicode功能欺骗用户:同形字用于假冒域名,双向覆盖用于伪造文件扩展名,不可见字符用于隐藏文本。
利用Unicode双向覆盖字符(U+202A–U+202E、U+2066–U+2069)伪装恶意文件名或代码的攻击,'readmefdp.exe'显示为'readmeexe.pdf'。
来自不同文字系统但外观相同或非常相似的字符,如拉丁'a'与西里尔'а',用于网络钓鱼、欺骗和社会工程学攻击。
Unicode对视觉上可能混淆的字符对的官方术语,定义于confusables.txt(UCD),比同形字范围更广,包含仅仅相似而非完全相同的字符。
识别混合不同文字系统字符的文本(如拉丁文+西里尔文),是防御同形字攻击的主要手段,浏览器据此触发Punycode显示。
U+200D,请求相邻字符连接,是表情符号序列的关键(👩+ZWJ+💻=👩💻),在印度文字中请求形成连字,也可用于隐藏文本边界。
U+200C,阻止相邻字符连接,在波斯语/阿拉伯语中对正确字母形式是必需的,也用于梵文中阻止连字。