Unicode Security Overview
Unicode's vast character set introduces a range of security vulnerabilities including homograph attacks, bidirectional text spoofing, normalization exploits, and invisible character injection. This overview explains the major categories of Unicode security risks and provides a framework for defending against them in web applications and APIs.
Unicode's mission to encode every writing system creates an enormous attack surface for security exploits. Characters that look identical but have different code points, invisible formatting marks that alter text flow, and bidirectional override controls that reverse displayed text — all of these are legitimate Unicode features that can be weaponized. This guide provides a comprehensive overview of Unicode security threats, from confusable characters and homograph attacks to bidirectional injection and invisible text manipulation, along with the defense mechanisms defined in Unicode Technical Standard #39 (UTS #39) and practical mitigation strategies.
The Fundamental Security Problem
Unicode encodes over 149,000 characters from hundreds of scripts. Many characters across different scripts look identical or nearly identical when rendered:
| Latin | Cyrillic | Greek | Look |
|---|---|---|---|
| a (U+0061) | а (U+0430) | α (U+03B1) | Similar |
| e (U+0065) | е (U+0435) | ε (U+03B5) | Similar |
| o (U+006F) | о (U+043E) | ο (U+03BF) | Similar |
| p (U+0070) | р (U+0440) | ρ (U+03C1) | Similar |
| c (U+0063) | с (U+0441) | ς (U+03C2) | Similar |
| x (U+0078) | х (U+0445) | χ (U+03C7) | Similar |
An attacker can construct a string that looks identical to a legitimate string but contains characters from different scripts. The human eye sees "apple.com" but the URL bar contains "аррӏе.com" (Cyrillic а, р, р, Kazakh ӏ, Cyrillic е).
Categories of Unicode Security Threats
1. Confusable Characters (Visual Spoofing)
Confusable characters are distinct code points that render identically or near-identically in common fonts. The Unicode Consortium maintains a comprehensive database of confusables in UTS #39 (Unicode Security Mechanisms).
Types of confusables:
| Type | Example | Risk |
|---|---|---|
| Cross-script | Latin "a" vs. Cyrillic "а" | Domain spoofing, username impersonation |
| Same-script | Latin "l" vs. Latin "I" | Password confusion, code bugs |
| Composed vs. decomposed | "e" + combining acute vs. precomposed "e" | String comparison failures |
| Fullwidth vs. halfwidth | A (U+FF21) vs. A (U+0041) | Filter bypass |
2. Bidirectional Text Attacks
Unicode's Bidirectional Algorithm (BiDi) handles mixed left-to-right and right-to-left text. BiDi override characters can manipulate the displayed order of text:
| Character | Code Point | Effect |
|---|---|---|
| LRO | U+202D | Forces left-to-right display |
| RLO | U+202E | Forces right-to-left display |
| LRI | U+2066 | Left-to-right isolate |
| RLI | U+2067 | Right-to-left isolate |
| U+202C | Pops directional formatting | |
| PDI | U+2069 | Pops directional isolate |
The "Trojan Source" attack (CVE-2021-42574) demonstrated that BiDi override characters embedded in source code can make code appear to do one thing while actually doing another:
# DANGER: This example shows the concept (BiDi chars removed for safety)
# In a Trojan Source attack, RLI/PDI characters can reorder
# code so that what appears as:
# if access_level == "user":
# actually evaluates as:
# if access_level == "admin":
# because BiDi overrides reorder the string literals visually
3. Invisible Character Injection
Unicode contains numerous characters that are invisible or zero-width:
| Character | Code Point | Name | Risk |
|---|---|---|---|
| ZWSP | U+200B | Zero Width Space | Invisible word boundaries |
| ZWJ | U+200D | Zero Width Joiner | Alters rendering of adjacent chars |
| ZWNJ | U+200C | Zero Width Non-Joiner | Prevents ligature formation |
| FEFF | U+FEFF | Byte Order Mark / ZWNBSP | Invisible at string start |
| Various | U+2000–U+200A | Different-width spaces | Width spoofing |
| SHY | U+00AD | Soft Hyphen | Invisible except at line breaks |
| WJ | U+2060 | Word Joiner | Prevents line break |
These can be used to: - Create usernames that appear identical but differ in invisible characters - Bypass keyword filters ("bad-word" with ZWSP passes "bad-word" filter) - Alter string comparisons in security-critical code - Watermark text for tracking copy-paste origins
4. Normalization Attacks
Unicode normalization (NFC, NFD, NFKC, NFKD) transforms text into canonical forms, but inconsistent normalization across systems creates vulnerabilities:
# Same visual appearance, different bytes
nfc = "\u00e9" # LATIN SMALL LETTER E WITH ACUTE (precomposed)
nfd = "e\u0301" # "e" + COMBINING ACUTE ACCENT (decomposed)
print(nfc == nfd) # False (without normalization)
print(nfc) # e
print(nfd) # e (looks identical)
import unicodedata
print(unicodedata.normalize("NFC", nfd) == nfc) # True
If a login system normalizes at registration but not at authentication (or vice versa), an attacker can register a visually identical username.
UTS #39: Unicode Security Mechanisms
The Unicode Consortium publishes UTS #39 (Unicode Security Mechanisms), which provides:
Confusable Detection
A machine-readable mapping of confusable characters (confusables.txt). The algorithm
computes a skeleton — a canonical form where all confusable variants are mapped to the
same string:
# Conceptual skeleton computation (simplified)
# The actual data comes from Unicode confusables.txt
def skeleton(text):
# Map each character to its confusable prototype
# Then normalize to NFD
pass
# skeleton("paypal.com") == skeleton("pаypal.com") # True
# (second "a" is Cyrillic U+0430)
Mixed-Script Detection
UTS #39 defines script restriction levels:
| Level | Description | Example |
|---|---|---|
| Single Script | All characters from one script + Common | "hello" (Latin only) |
| Highly Restrictive | Limited script combinations (e.g., CJK + Latin) | "東京Tower" |
| Moderately Restrictive | Common script combinations allowed | Most legitimate text |
| Minimally Restrictive | No script restrictions | Any combination |
| Unrestricted | No checks at all | Not recommended |
Identifier Profiles
UTS #39 also defines which characters are safe for identifiers (usernames, domain labels, variable names):
| Profile | Use Case | Restrictions |
|---|---|---|
| Identifier | Programming languages | Limited to identifier-safe characters |
| Domain Label | Internationalized Domain Names | Strict script mixing rules |
| Username | Social platforms | Confusable + mixed-script checks |
Practical Defenses
For Domain Names (IDN)
# Check for mixed scripts in domain labels
import unicodedata
def get_scripts(text):
scripts = set()
for ch in text:
# Unicode Script property
name = unicodedata.name(ch, "")
if "CYRILLIC" in name:
scripts.add("Cyrillic")
elif "LATIN" in name:
scripts.add("Latin")
elif "GREEK" in name:
scripts.add("Greek")
# ... (simplified — use ICU or regex for production)
return scripts
domain_label = "аррӏе"
scripts = get_scripts(domain_label)
if len(scripts) > 1:
print(f"MIXED SCRIPTS DETECTED: {scripts}")
For User Input
import unicodedata
# Strip invisible characters
INVISIBLE_CHARS = {
0x200B, # ZERO WIDTH SPACE
0x200C, # ZERO WIDTH NON-JOINER
0x200D, # ZERO WIDTH JOINER
0x200E, # LEFT-TO-RIGHT MARK
0x200F, # RIGHT-TO-LEFT MARK
0x202A, # LEFT-TO-RIGHT EMBEDDING
0x202B, # RIGHT-TO-LEFT EMBEDDING
0x202C, # POP DIRECTIONAL FORMATTING
0x202D, # LEFT-TO-RIGHT OVERRIDE
0x202E, # RIGHT-TO-LEFT OVERRIDE
0x2060, # WORD JOINER
0x2061, # FUNCTION APPLICATION
0x2062, # INVISIBLE TIMES
0x2063, # INVISIBLE SEPARATOR
0x2064, # INVISIBLE PLUS
0xFEFF, # ZERO WIDTH NO-BREAK SPACE
}
def sanitize_input(text):
return "".join(ch for ch in text if ord(ch) not in INVISIBLE_CHARS)
For Source Code Review
# Detect BiDi override characters in source files
BIDI_OVERRIDES = {0x202A, 0x202B, 0x202C, 0x202D, 0x202E,
0x2066, 0x2067, 0x2068, 0x2069}
def check_source_file(filepath):
with open(filepath, "r", encoding="utf-8") as f:
for line_no, line in enumerate(f, 1):
for i, ch in enumerate(line):
if ord(ch) in BIDI_OVERRIDES:
print(f"Line {line_no}, col {i}: BiDi override "
f"U+{ord(ch):04X} detected")
Key Takeaways
- Unicode's character diversity creates security risks through confusable characters (visual spoofing), BiDi overrides (text reordering), invisible characters (hidden manipulation), and normalization inconsistencies.
- UTS #39 (Unicode Security Mechanisms) provides confusable mappings, mixed-script detection, and identifier restriction levels — the authoritative framework for Unicode security.
- The Trojan Source attack (CVE-2021-42574) proved that BiDi override characters in source code can make malicious code appear benign, affecting most programming languages.
- Defense in depth requires: normalization (consistent NFC/NFKC), confusable detection (skeleton comparison), script restriction (reject mixed scripts where inappropriate), and invisible character stripping.
- Browsers defend against IDN homograph attacks by displaying Punycode (xn--...) when mixed scripts are detected in domain names.
- Always normalize and sanitize Unicode input before security-critical operations like authentication, authorization, and string comparison.
Lainnya di Unicode Security
IDN homograph attacks use look-alike Unicode characters to register domain names that …
Zero-width and other invisible Unicode characters can be used to fingerprint text …
Unicode passwords introduce normalization ambiguity that can cause authentication failures or allow …
Phishing attacks increasingly exploit Unicode confusables, bidirectional overrides, and invisible characters to …