The Developer's Unicode Handbook · Chapter 7

Security Hardening

Practical defensive techniques against Unicode attacks: confusable detection with ICU, normalization before comparison, bidi sandboxing, and secure identifier validation.

~4,000 words · ~16 min read · · Updated

Unicode is a security attack surface that most developers don't think about until after an incident. Homoglyph attacks, bidirectional text exploits, normalization inconsistencies, and zero-width character injection are real vulnerabilities with CVE numbers assigned to them. This chapter provides a concrete security checklist with code examples for each threat.

Normalize Everything at the Boundary

The foundation of Unicode security is normalization. If your application stores some strings in NFC and others in NFD, two strings that represent the same text will compare as unequal. This creates authentication bypasses, duplicate account creation exploits, and path traversal vulnerabilities.

import unicodedata

# Security-critical normalization function
def security_normalize(text: str) -> str:
    # Normalize text for security-sensitive operations.
    Apply at every trust boundary: API inputs, form submissions, file paths.
    # NFC: canonical composed form
    # Use NFKC for additional compatibility folding (ligatures, width variants)
    return unicodedata.normalize("NFC", text)

# The vulnerability: login bypass via normalization difference
# Without normalization:
stored_username = "caf\\u00E9"      # NFC stored in DB
login_attempt = "cafe\\u0301"       # NFD sent by attacker

print(stored_username == login_attempt)  # False — bypass doesn't work here
# But what if password hashing normalizes and username check doesn't?
# An inconsistent normalization policy creates gaps.

# The fix: normalize at every boundary
def authenticate(username: str, password: str) -> bool:
    normalized_username = security_normalize(username)
    stored = db.get_user(normalized_username)  # query with normalized form
    return stored and verify_password(password, stored.password_hash)

The rule: normalize once, at write time. Store the normalized form. Query with the same normalization applied to the query string. Consistency eliminates the gap.

Homoglyph Detection and Prevention

Homoglyphs are visually identical characters from different Unicode scripts. Phishing attacks use them to register domains like pаypal.com (Cyrillic а) that look identical to paypal.com (Latin a).

# Method 1: Reject mixed-script identifiers
import re

# Character ranges for common scripts (simplified)
LATIN_RE = re.compile(r"^[\\u0000-\\u024F]+$")      # Latin extended
CYRILLIC_RE = re.compile(r"^[\\u0400-\\u04FF]+$")   # Cyrillic
GREEK_RE = re.compile(r"^[\\u0370-\\u03FF]+$")      # Greek

def has_mixed_dangerous_scripts(text: str) -> bool:
    # Detect if text mixes Latin with visually similar scripts (Cyrillic, Greek).
    This is a heuristic; use confusable-homoglyphs for production.
    has_latin = bool(re.search(r"[a-zA-Z\\u00C0-\\u024F]", text))
    has_cyrillic = bool(re.search(r"[\\u0400-\\u04FF]", text))
    has_greek = bool(re.search(r"[\\u0370-\\u03FF]", text))

    dangerous_mixes = [
        has_latin and has_cyrillic,
        has_latin and has_greek,
        has_cyrillic and has_greek,
    ]
    return any(dangerous_mixes)

# Method 2: Use the confusable-homoglyphs database (Unicode TR39)
# pip install confusable-homoglyphs
from confusable_homoglyphs import confusables

def check_identifier_safety(identifier: str) -> list[str]:
    # Return list of characters that have confusable alternatives.
    warnings = []
    for char in identifier:
        result = confusables.is_confusable(char, preferred_aliases=["LATIN"])
        if result:
            alternatives = [item["c"] for item in result]
            warnings.append(f"'{char}' (U+{ord(char):04X}) looks like: {alternatives}")
    return warnings

identifier = "pаypal"  # Cyrillic а
warnings = check_identifier_safety(identifier)
for w in warnings:
    print(w)  # 'а' (U+0430) looks like: ['a', ...]

Stripping Bidi Override Characters

Unicode Bidi override characters can reverse the visual order of text, making malicious filenames, URLs, or code appear benign. CVE-2021-42574 ("Trojan Source") demonstrated that bidi overrides in source code comments could hide malicious code that looks different in an editor vs. compiler.

import re
import unicodedata

# Complete set of bidi control characters that are dangerous in user input
BIDI_DANGEROUS = {
    "\\u202A",  # LEFT-TO-RIGHT EMBEDDING
    "\\u202B",  # RIGHT-TO-LEFT EMBEDDING
    "\\u202C",  # POP DIRECTIONAL FORMATTING
    "\\u202D",  # LEFT-TO-RIGHT OVERRIDE
    "\\u202E",  # RIGHT-TO-LEFT OVERRIDE  ← most dangerous
    "\\u2066",  # LEFT-TO-RIGHT ISOLATE
    "\\u2067",  # RIGHT-TO-LEFT ISOLATE
    "\\u2068",  # FIRST STRONG ISOLATE
    "\\u2069",  # POP DIRECTIONAL ISOLATE
}

BIDI_DANGEROUS_RE = re.compile("[" + "".join(BIDI_DANGEROUS) + "]")

def contains_bidi_override(text: str) -> bool:
    return bool(BIDI_DANGEROUS_RE.search(text))

def strip_bidi_overrides(text: str) -> str:
    return BIDI_DANGEROUS_RE.sub("", text)

# Example attack: filename with bidi override
# The string looks like "innocent.txt" but is actually "txt.tneconni"
attack_filename = "innocent\\u202Etxt."
print(attack_filename)              # Shows reversed in right-to-left context
print(contains_bidi_override(attack_filename))  # True
print(strip_bidi_overrides(attack_filename))    # "innocenttxt." — revealed

# Apply to all user-provided filenames, URLs, identifiers
def sanitize_user_input(text: str) -> str:
    if contains_bidi_override(text):
        raise ValueError(f"Input contains bidi override characters: {repr(text)}")
    return text

IDN Validation with UTS #46

Unicode Technical Standard #46 defines rules for safe processing of internationalized domain names. It prevents known homograph attacks through a validity profile:

import idna  # pip install idna

# IDNA2008 with UTS#46 compatibility processing
def validate_domain_uts46(domain: str) -> str | None:
    # Validate and normalize a domain name per UTS#46.
    Returns the normalized ASCII form, or None if invalid.
    try:
        # idna library implements UTS#46
        # transitional=False: IDNA2008 strictness
        # check_hyphens: reject labels with hyphens in positions 3-4 (unless xn--)
        ascii_domain = idna.encode(
            domain,
            alg="transitional",
        ).decode("ascii")
        return ascii_domain
    except (idna.core.InvalidCodepoint, idna.core.InvalidCodepointContext,
            idna.core.InvalidAlabel, UnicodeError):
        return None

# Security tests
print(validate_domain_uts46("münchen.de"))           # Valid IDN → punycode
print(validate_domain_uts46("example.com"))           # Valid ASCII
print(validate_domain_uts46("xn--nxasmq6b.com"))      # Valid punycode
print(validate_domain_uts46("pаypal.com"))             # May flag mixed script
print(validate_domain_uts46("ab--test.com"))           # Invalid (IDNA2003 style)

SQL Injection via Unicode

SQL injection via Unicode typically works through normalization bypass: an attack string that doesn't look like SQL injection gets normalized by the database into a SQL fragment.

# Mitigation: parameterized queries (always)
# This is the same defense as ASCII SQL injection — use parameterized queries

import psycopg2

# WRONG: string concatenation (vulnerable even with Unicode normalization)
def unsafe_query(username: str) -> list:
    query = f"SELECT * FROM users WHERE name = '{username}'"
    # Attacker provides: '; DROP TABLE users; --
    # Or Unicode normalized to: '; DROP TABLE users; --
    cursor.execute(query)  # VULNERABLE

# RIGHT: parameterized query (safe regardless of Unicode content)
def safe_query(conn, username: str) -> list:
    normalized_username = unicodedata.normalize("NFC", username)
    with conn.cursor() as cursor:
        cursor.execute(
            "SELECT * FROM users WHERE name = %s",
            (normalized_username,)  # Parameterized — database handles escaping
        )
        return cursor.fetchall()

The rule for SQL security doesn't change with Unicode: always use parameterized queries. Normalization is an additional defense in depth, not a substitute.

XSS via Unicode in HTML Context

Unicode provides multiple ways to represent characters that have HTML significance (<, >, &, ", '). Template engines that HTML-escape only the ASCII forms can be bypassed.

from html import escape as html_escape

# HTML entities using Unicode alternatives
attack_strings = [
    "\\u003Cscript\\u003Ealert(1)\\u003C/script\\u003E",  # < > via Unicode escapes
    "\\uff1Cscript\\uff1E alert(1)",  # < > fullwidth less-than/greater-than
]

# The fix: normalize first, then escape
def safe_html(text: str) -> str:
    # NFKC converts fullwidth < > to ASCII < >
    normalized = unicodedata.normalize("NFKC", text)
    # Now html.escape catches them
    return html_escape(normalized)

for attack in attack_strings:
    print(safe_html(attack))
    # Both are properly escaped after NFKC normalization

WAF Bypass via Unicode Normalization

Web Application Firewalls that pattern-match on ASCII representations can be bypassed using Unicode equivalents. A WAF looking for <script> won't catch <script> (fullwidth brackets). This is why normalization must happen in application code, not just at the WAF layer.

# WAF bypass patterns and their normalized forms
bypass_examples = {
    "\\uff1Cscript\\uff1E": "<script>",    # Fullwidth brackets
    "\\u2039script\\u203A": "<script>",    # Single angle quotation marks (approximate)
    "SELE\\u0043T": "SELECT",              # Unicode homoglyph in keyword
    "javascript\\uff1A": "javascript:",    # Fullwidth colon in URL scheme
}

for attack, expected in bypass_examples.items():
    normalized = unicodedata.normalize("NFKC", attack)
    print(f"WAF sees: {attack}")
    print(f"App sees: {normalized}")
    print()

Complete Security Hardening Checklist

Apply each of these to user-controlled input before any security-sensitive operation:

import re
import unicodedata

# All-in-one security sanitizer
def security_sanitize(
    text: str,
    allow_bidi: bool = False,
    normalization: str = "NFC",
) -> str:
    # Apply all Unicode security hardening steps.

    Args:
        text: Raw user input
        allow_bidi: Set True only for RTL content editors
        normalization: "NFC" for most use cases, "NFKC" for identifiers
    # 1. Unicode normalize
    text = unicodedata.normalize(normalization, text)

    # 2. Strip/reject bidi overrides (unless explicitly allowed)
    if not allow_bidi:
        if re.search("[\\u202A-\\u202E\\u2066-\\u2069]", text):
            raise ValueError("Bidi override characters not allowed")

    # 3. Strip zero-width characters
    text = re.sub("[\\u200B-\\u200D\\uFEFF\\u2060\\u180E]", "", text)

    # 4. Strip null bytes (common attack vector)
    text = text.replace("\\x00", "")

    # 5. Normalize whitespace
    text = re.sub(r"\\s+", " ", text)
    text = text.strip()

    return text

The security mindset for Unicode: treat every unusual Unicode character in user input as potentially adversarial until proven otherwise. Normalize aggressively, reject what you can't handle, and validate after normalization (not before).