Unicode Security Overview

Unicode's mission to encode every writing system creates an enormous attack surface for security exploits. Characters that look identical but have different code points, invisible formatting marks that alter text flow, and bidirectional override controls that reverse displayed text — all of these are legitimate Unicode features that can be weaponized. This guide provides a comprehensive overview of Unicode security threats, from confusable characters and homograph attacks to bidirectional injection and invisible text manipulation, along with the defense mechanisms defined in Unicode Technical Standard #39 (UTS #39) and practical mitigation strategies.

The Fundamental Security Problem

Unicode encodes over 149,000 characters from hundreds of scripts. Many characters across different scripts look identical or nearly identical when rendered:

Latin	Cyrillic	Greek	Look
a (U+0061)	а (U+0430)	α (U+03B1)	Similar
e (U+0065)	е (U+0435)	ε (U+03B5)	Similar
o (U+006F)	о (U+043E)	ο (U+03BF)	Similar
p (U+0070)	р (U+0440)	ρ (U+03C1)	Similar
c (U+0063)	с (U+0441)	ς (U+03C2)	Similar
x (U+0078)	х (U+0445)	χ (U+03C7)	Similar

An attacker can construct a string that looks identical to a legitimate string but contains characters from different scripts. The human eye sees "apple.com" but the URL bar contains "аррӏе.com" (Cyrillic а, р, р, Kazakh ӏ, Cyrillic е).

Categories of Unicode Security Threats

1. Confusable Characters (Visual Spoofing)

Confusable characters are distinct code points that render identically or near-identically in common fonts. The Unicode Consortium maintains a comprehensive database of confusables in UTS #39 (Unicode Security Mechanisms).

Types of confusables:

Type	Example	Risk
Cross-script	Latin "a" vs. Cyrillic "а"	Domain spoofing, username impersonation
Same-script	Latin "l" vs. Latin "I"	Password confusion, code bugs
Composed vs. decomposed	"e" + combining acute vs. precomposed "e"	String comparison failures
Fullwidth vs. halfwidth	A (U+FF21) vs. A (U+0041)	Filter bypass

2. Bidirectional Text Attacks

Unicode's Bidirectional Algorithm (BiDi) handles mixed left-to-right and right-to-left text. BiDi override characters can manipulate the displayed order of text:

Character	Code Point	Effect
LRO	U+202D	Forces left-to-right display
RLO	U+202E	Forces right-to-left display
LRI	U+2066	Left-to-right isolate
RLI	U+2067	Right-to-left isolate
PDF	U+202C	Pops directional formatting
PDI	U+2069	Pops directional isolate

The "Trojan Source" attack (CVE-2021-42574) demonstrated that BiDi override characters embedded in source code can make code appear to do one thing while actually doing another:

# DANGER: This example shows the concept (BiDi chars removed for safety)
# In a Trojan Source attack, RLI/PDI characters can reorder
# code so that what appears as:
#   if access_level == "user":
# actually evaluates as:
#   if access_level == "admin":
# because BiDi overrides reorder the string literals visually

3. Invisible Character Injection

Unicode contains numerous characters that are invisible or zero-width:

Character	Code Point	Name	Risk
ZWSP	U+200B	Zero Width Space	Invisible word boundaries
ZWJ	U+200D	Zero Width Joiner	Alters rendering of adjacent chars
ZWNJ	U+200C	Zero Width Non-Joiner	Prevents ligature formation
FEFF	U+FEFF	Byte Order Mark / ZWNBSP	Invisible at string start
Various	U+2000–U+200A	Different-width spaces	Width spoofing
SHY	U+00AD	Soft Hyphen	Invisible except at line breaks
WJ	U+2060	Word Joiner	Prevents line break

These can be used to: - Create usernames that appear identical but differ in invisible characters - Bypass keyword filters ("bad-word" with ZWSP passes "bad-word" filter) - Alter string comparisons in security-critical code - Watermark text for tracking copy-paste origins

4. Normalization Attacks

Unicode normalization (NFC, NFD, NFKC, NFKD) transforms text into canonical forms, but inconsistent normalization across systems creates vulnerabilities:

# Same visual appearance, different bytes
nfc = "\u00e9"         # LATIN SMALL LETTER E WITH ACUTE (precomposed)
nfd = "e\u0301"        # "e" + COMBINING ACUTE ACCENT (decomposed)

print(nfc == nfd)      # False (without normalization)
print(nfc)             # e
print(nfd)             # e (looks identical)

import unicodedata
print(unicodedata.normalize("NFC", nfd) == nfc)  # True

If a login system normalizes at registration but not at authentication (or vice versa), an attacker can register a visually identical username.

UTS #39: Unicode Security Mechanisms

The Unicode Consortium publishes UTS #39 (Unicode Security Mechanisms), which provides:

Confusable Detection

A machine-readable mapping of confusable characters (confusables.txt). The algorithm computes a skeleton — a canonical form where all confusable variants are mapped to the same string:

# Conceptual skeleton computation (simplified)
# The actual data comes from Unicode confusables.txt
def skeleton(text):
    # Map each character to its confusable prototype
    # Then normalize to NFD
    pass

# skeleton("paypal.com") == skeleton("pаypal.com")  # True
# (second "a" is Cyrillic U+0430)

Mixed-Script Detection

UTS #39 defines script restriction levels:

Level	Description	Example
Single Script	All characters from one script + Common	"hello" (Latin only)
Highly Restrictive	Limited script combinations (e.g., CJK + Latin)	"東京Tower"
Moderately Restrictive	Common script combinations allowed	Most legitimate text
Minimally Restrictive	No script restrictions	Any combination
Unrestricted	No checks at all	Not recommended

Identifier Profiles

UTS #39 also defines which characters are safe for identifiers (usernames, domain labels, variable names):

Profile	Use Case	Restrictions
Identifier	Programming languages	Limited to identifier-safe characters
Domain Label	Internationalized Domain Names	Strict script mixing rules
Username	Social platforms	Confusable + mixed-script checks

Practical Defenses

For Domain Names (IDN)

# Check for mixed scripts in domain labels
import unicodedata

def get_scripts(text):
    scripts = set()
    for ch in text:
        # Unicode Script property
        name = unicodedata.name(ch, "")
        if "CYRILLIC" in name:
            scripts.add("Cyrillic")
        elif "LATIN" in name:
            scripts.add("Latin")
        elif "GREEK" in name:
            scripts.add("Greek")
        # ... (simplified — use ICU or regex for production)
    return scripts

domain_label = "аррӏе"
scripts = get_scripts(domain_label)
if len(scripts) > 1:
    print(f"MIXED SCRIPTS DETECTED: {scripts}")

For User Input

import unicodedata

# Strip invisible characters
INVISIBLE_CHARS = {
    0x200B,  # ZERO WIDTH SPACE
    0x200C,  # ZERO WIDTH NON-JOINER
    0x200D,  # ZERO WIDTH JOINER
    0x200E,  # LEFT-TO-RIGHT MARK
    0x200F,  # RIGHT-TO-LEFT MARK
    0x202A,  # LEFT-TO-RIGHT EMBEDDING
    0x202B,  # RIGHT-TO-LEFT EMBEDDING
    0x202C,  # POP DIRECTIONAL FORMATTING
    0x202D,  # LEFT-TO-RIGHT OVERRIDE
    0x202E,  # RIGHT-TO-LEFT OVERRIDE
    0x2060,  # WORD JOINER
    0x2061,  # FUNCTION APPLICATION
    0x2062,  # INVISIBLE TIMES
    0x2063,  # INVISIBLE SEPARATOR
    0x2064,  # INVISIBLE PLUS
    0xFEFF,  # ZERO WIDTH NO-BREAK SPACE
}

def sanitize_input(text):
    return "".join(ch for ch in text if ord(ch) not in INVISIBLE_CHARS)

For Source Code Review

# Detect BiDi override characters in source files
BIDI_OVERRIDES = {0x202A, 0x202B, 0x202C, 0x202D, 0x202E,
                  0x2066, 0x2067, 0x2068, 0x2069}

def check_source_file(filepath):
    with open(filepath, "r", encoding="utf-8") as f:
        for line_no, line in enumerate(f, 1):
            for i, ch in enumerate(line):
                if ord(ch) in BIDI_OVERRIDES:
                    print(f"Line {line_no}, col {i}: BiDi override "
                          f"U+{ord(ch):04X} detected")

Key Takeaways

Unicode's character diversity creates security risks through confusable characters (visual spoofing), BiDi overrides (text reordering), invisible characters (hidden manipulation), and normalization inconsistencies.
UTS #39 (Unicode Security Mechanisms) provides confusable mappings, mixed-script detection, and identifier restriction levels — the authoritative framework for Unicode security.
The Trojan Source attack (CVE-2021-42574) proved that BiDi override characters in source code can make malicious code appear benign, affecting most programming languages.
Defense in depth requires: normalization (consistent NFC/NFKC), confusable detection (skeleton comparison), script restriction (reject mixed scripts where inappropriate), and invisible character stripping.
Browsers defend against IDN homograph attacks by displaying Punycode (xn--...) when mixed scripts are detected in domain names.
Always normalize and sanitize Unicode input before security-critical operations like authentication, authorization, and string comparison.