🔒 Unicode Security

Unicode Security Overview

Unicode's vast character set introduces a range of security vulnerabilities including homograph attacks, bidirectional text spoofing, normalization exploits, and invisible character injection. This overview explains the major categories of Unicode security risks and provides a framework for defending against them in web applications and APIs.

·

Unicode's mission to encode every writing system creates an enormous attack surface for security exploits. Characters that look identical but have different code points, invisible formatting marks that alter text flow, and bidirectional override controls that reverse displayed text — all of these are legitimate Unicode features that can be weaponized. This guide provides a comprehensive overview of Unicode security threats, from confusable characters and homograph attacks to bidirectional injection and invisible text manipulation, along with the defense mechanisms defined in Unicode Technical Standard #39 (UTS #39) and practical mitigation strategies.

The Fundamental Security Problem

Unicode encodes over 149,000 characters from hundreds of scripts. Many characters across different scripts look identical or nearly identical when rendered:

Latin Cyrillic Greek Look
a (U+0061) а (U+0430) α (U+03B1) Similar
e (U+0065) е (U+0435) ε (U+03B5) Similar
o (U+006F) о (U+043E) ο (U+03BF) Similar
p (U+0070) р (U+0440) ρ (U+03C1) Similar
c (U+0063) с (U+0441) ς (U+03C2) Similar
x (U+0078) х (U+0445) χ (U+03C7) Similar

An attacker can construct a string that looks identical to a legitimate string but contains characters from different scripts. The human eye sees "apple.com" but the URL bar contains "аррӏе.com" (Cyrillic а, р, р, Kazakh ӏ, Cyrillic е).

Categories of Unicode Security Threats

1. Confusable Characters (Visual Spoofing)

Confusable characters are distinct code points that render identically or near-identically in common fonts. The Unicode Consortium maintains a comprehensive database of confusables in UTS #39 (Unicode Security Mechanisms).

Types of confusables:

Type Example Risk
Cross-script Latin "a" vs. Cyrillic "а" Domain spoofing, username impersonation
Same-script Latin "l" vs. Latin "I" Password confusion, code bugs
Composed vs. decomposed "e" + combining acute vs. precomposed "e" String comparison failures
Fullwidth vs. halfwidth A (U+FF21) vs. A (U+0041) Filter bypass

2. Bidirectional Text Attacks

Unicode's Bidirectional Algorithm (BiDi) handles mixed left-to-right and right-to-left text. BiDi override characters can manipulate the displayed order of text:

Character Code Point Effect
LRO U+202D Forces left-to-right display
RLO U+202E Forces right-to-left display
LRI U+2066 Left-to-right isolate
RLI U+2067 Right-to-left isolate
PDF U+202C Pops directional formatting
PDI U+2069 Pops directional isolate

The "Trojan Source" attack (CVE-2021-42574) demonstrated that BiDi override characters embedded in source code can make code appear to do one thing while actually doing another:

# DANGER: This example shows the concept (BiDi chars removed for safety)
# In a Trojan Source attack, RLI/PDI characters can reorder
# code so that what appears as:
#   if access_level == "user":
# actually evaluates as:
#   if access_level == "admin":
# because BiDi overrides reorder the string literals visually

3. Invisible Character Injection

Unicode contains numerous characters that are invisible or zero-width:

Character Code Point Name Risk
ZWSP U+200B Zero Width Space Invisible word boundaries
ZWJ U+200D Zero Width Joiner Alters rendering of adjacent chars
ZWNJ U+200C Zero Width Non-Joiner Prevents ligature formation
FEFF U+FEFF Byte Order Mark / ZWNBSP Invisible at string start
Various U+2000–U+200A Different-width spaces Width spoofing
SHY U+00AD Soft Hyphen Invisible except at line breaks
WJ U+2060 Word Joiner Prevents line break

These can be used to: - Create usernames that appear identical but differ in invisible characters - Bypass keyword filters ("b​​ad-word" with ZWSP passes "bad-word" filter) - Alter string comparisons in security-critical code - Watermark text for tracking copy-paste origins

4. Normalization Attacks

Unicode normalization (NFC, NFD, NFKC, NFKD) transforms text into canonical forms, but inconsistent normalization across systems creates vulnerabilities:

# Same visual appearance, different bytes
nfc = "\u00e9"         # LATIN SMALL LETTER E WITH ACUTE (precomposed)
nfd = "e\u0301"        # "e" + COMBINING ACUTE ACCENT (decomposed)

print(nfc == nfd)      # False (without normalization)
print(nfc)             # e
print(nfd)             # e (looks identical)

import unicodedata
print(unicodedata.normalize("NFC", nfd) == nfc)  # True

If a login system normalizes at registration but not at authentication (or vice versa), an attacker can register a visually identical username.

UTS #39: Unicode Security Mechanisms

The Unicode Consortium publishes UTS #39 (Unicode Security Mechanisms), which provides:

Confusable Detection

A machine-readable mapping of confusable characters (confusables.txt). The algorithm computes a skeleton — a canonical form where all confusable variants are mapped to the same string:

# Conceptual skeleton computation (simplified)
# The actual data comes from Unicode confusables.txt
def skeleton(text):
    # Map each character to its confusable prototype
    # Then normalize to NFD
    pass

# skeleton("paypal.com") == skeleton("pаypal.com")  # True
# (second "a" is Cyrillic U+0430)

Mixed-Script Detection

UTS #39 defines script restriction levels:

Level Description Example
Single Script All characters from one script + Common "hello" (Latin only)
Highly Restrictive Limited script combinations (e.g., CJK + Latin) "東京Tower"
Moderately Restrictive Common script combinations allowed Most legitimate text
Minimally Restrictive No script restrictions Any combination
Unrestricted No checks at all Not recommended

Identifier Profiles

UTS #39 also defines which characters are safe for identifiers (usernames, domain labels, variable names):

Profile Use Case Restrictions
Identifier Programming languages Limited to identifier-safe characters
Domain Label Internationalized Domain Names Strict script mixing rules
Username Social platforms Confusable + mixed-script checks

Practical Defenses

For Domain Names (IDN)

# Check for mixed scripts in domain labels
import unicodedata

def get_scripts(text):
    scripts = set()
    for ch in text:
        # Unicode Script property
        name = unicodedata.name(ch, "")
        if "CYRILLIC" in name:
            scripts.add("Cyrillic")
        elif "LATIN" in name:
            scripts.add("Latin")
        elif "GREEK" in name:
            scripts.add("Greek")
        # ... (simplified — use ICU or regex for production)
    return scripts

domain_label = "аррӏе"
scripts = get_scripts(domain_label)
if len(scripts) > 1:
    print(f"MIXED SCRIPTS DETECTED: {scripts}")

For User Input

import unicodedata

# Strip invisible characters
INVISIBLE_CHARS = {
    0x200B,  # ZERO WIDTH SPACE
    0x200C,  # ZERO WIDTH NON-JOINER
    0x200D,  # ZERO WIDTH JOINER
    0x200E,  # LEFT-TO-RIGHT MARK
    0x200F,  # RIGHT-TO-LEFT MARK
    0x202A,  # LEFT-TO-RIGHT EMBEDDING
    0x202B,  # RIGHT-TO-LEFT EMBEDDING
    0x202C,  # POP DIRECTIONAL FORMATTING
    0x202D,  # LEFT-TO-RIGHT OVERRIDE
    0x202E,  # RIGHT-TO-LEFT OVERRIDE
    0x2060,  # WORD JOINER
    0x2061,  # FUNCTION APPLICATION
    0x2062,  # INVISIBLE TIMES
    0x2063,  # INVISIBLE SEPARATOR
    0x2064,  # INVISIBLE PLUS
    0xFEFF,  # ZERO WIDTH NO-BREAK SPACE
}

def sanitize_input(text):
    return "".join(ch for ch in text if ord(ch) not in INVISIBLE_CHARS)

For Source Code Review

# Detect BiDi override characters in source files
BIDI_OVERRIDES = {0x202A, 0x202B, 0x202C, 0x202D, 0x202E,
                  0x2066, 0x2067, 0x2068, 0x2069}

def check_source_file(filepath):
    with open(filepath, "r", encoding="utf-8") as f:
        for line_no, line in enumerate(f, 1):
            for i, ch in enumerate(line):
                if ord(ch) in BIDI_OVERRIDES:
                    print(f"Line {line_no}, col {i}: BiDi override "
                          f"U+{ord(ch):04X} detected")

Key Takeaways

  • Unicode's character diversity creates security risks through confusable characters (visual spoofing), BiDi overrides (text reordering), invisible characters (hidden manipulation), and normalization inconsistencies.
  • UTS #39 (Unicode Security Mechanisms) provides confusable mappings, mixed-script detection, and identifier restriction levels — the authoritative framework for Unicode security.
  • The Trojan Source attack (CVE-2021-42574) proved that BiDi override characters in source code can make malicious code appear benign, affecting most programming languages.
  • Defense in depth requires: normalization (consistent NFC/NFKC), confusable detection (skeleton comparison), script restriction (reject mixed scripts where inappropriate), and invisible character stripping.
  • Browsers defend against IDN homograph attacks by displaying Punycode (xn--...) when mixed scripts are detected in domain names.
  • Always normalize and sanitize Unicode input before security-critical operations like authentication, authorization, and string comparison.

Unicode Security のその他のガイド