🔒 Unicode Security

Preventing Unicode-based Phishing

Phishing attacks increasingly exploit Unicode confusables, bidirectional overrides, and invisible characters to create deceptive URLs, spoofed sender addresses, and misleading link text. This guide covers the techniques used in Unicode-based phishing attacks and the detection, prevention, and user-education strategies to defend against them.

·

Phishing attacks cost billions of dollars annually, and Unicode provides attackers with a sophisticated toolkit for making malicious content look legitimate. Visual spoofing through confusable characters, homograph domain attacks, invisible text manipulation, and deceptive display of URLs and email addresses all exploit the gap between what Unicode text contains and what humans see. This guide covers the major Unicode-based phishing techniques, detection algorithms, and prevention strategies for developers building systems that handle user-facing text, URLs, email addresses, and identity information.

The Landscape of Unicode Phishing

Unicode-based phishing differs from traditional phishing in a fundamental way: instead of relying on user carelessness ("did you notice the URL was amaz0n.com?"), it exploits the impossibility of visual verification. When Latin "a" and Cyrillic "а" are pixel-identical in common fonts, no amount of user education can distinguish them.

Attack Surface

Vector Technique Impact
Domain names IDN homograph attacks User visits malicious site believing it is legitimate
Email addresses Confusable local parts User replies to attacker's email
Display names Visual impersonation User trusts message from "admin" (Cyrillic а)
URLs in text Mixed-script URLs User clicks link they believe goes to trusted site
File names Bidirectional override in filename User opens "invoice.pdf" that is actually .exe
UI text Invisible characters in labels Buttons/labels show different text than they contain

Technique 1: Domain Homograph Attacks

The most well-documented Unicode phishing technique uses visually confusable characters to register domains that appear identical to legitimate ones:

Target Homograph Substitution
apple.com аррӏе.com Cyrillic а, р, ӏ, е
google.com ɡооɡӏе.com Latin ɡ, Cyrillic о, ӏ, е
paypal.com раураӏ.com Cyrillic р, а, у, ӏ
facebook.com fаcebook.com Single Cyrillic а

Detection

import unicodedata

# Known confusable pairs (subset — full list in Unicode confusables.txt)
CONFUSABLE_MAP = {
    0x0430: 0x0061,  # Cyrillic а -> Latin a
    0x0435: 0x0065,  # Cyrillic е -> Latin e
    0x043E: 0x006F,  # Cyrillic о -> Latin o
    0x0440: 0x0070,  # Cyrillic р -> Latin p
    0x0441: 0x0063,  # Cyrillic с -> Latin c
    0x0443: 0x0079,  # Cyrillic у -> Latin y
    0x0445: 0x0078,  # Cyrillic х -> Latin x
    0x04CF: 0x006C,  # Cyrillic ӏ -> Latin l
    0x0261: 0x0067,  # Latin ɡ -> Latin g
    0x03B1: 0x0061,  # Greek α -> Latin a
    0x03BF: 0x006F,  # Greek ο -> Latin o
}

def compute_skeleton(text):
    skeleton = []
    for ch in text:
        cp = ord(ch)
        # Map confusable to prototype
        mapped = CONFUSABLE_MAP.get(cp, cp)
        skeleton.append(chr(mapped))
    # Normalize result
    return unicodedata.normalize("NFD", "".join(skeleton))

def is_homograph(domain, target):
    return compute_skeleton(domain) == compute_skeleton(target)

# Test
print(is_homograph("аррӏе", "apple"))  # True
print(is_homograph("google", "google"))  # True
print(is_homograph("g\u043e\u043egle", "google"))  # True

Prevention

Layer Defense
Browser Display Punycode for mixed-script domains
Registrar Block registration of confusable domains
Email gateway Flag emails from confusable domains
Application Skeleton comparison against known brand domains
Certificate Authority Verify domain ownership for look-alike requests

Technique 2: Email Address Spoofing

Email local parts (before the @) can contain Unicode characters via EAI (Email Address Internationalization, RFC 6531). An attacker can register an email address with confusable characters:

Legitimate Spoofed Difference
[email protected] а[email protected] Cyrillic а in "аdmin"
[email protected] ѕ[email protected] Cyrillic ѕ in "ѕupport"

Detection for Email

def check_email_confusables(email):
    local_part, _, domain = email.rpartition("@")
    issues = []

    for i, ch in enumerate(local_part):
        cp = ord(ch)
        name = unicodedata.name(ch, "")

        # Flag non-Latin characters in a predominantly Latin context
        if cp > 0x007F:
            script = "UNKNOWN"
            for s in ["CYRILLIC", "GREEK", "ARMENIAN"]:
                if s in name:
                    script = s
                    break
            if script != "UNKNOWN":
                issues.append({
                    "position": i,
                    "char": ch,
                    "codepoint": f"U+{cp:04X}",
                    "script": script,
                    "confusable_with": chr(CONFUSABLE_MAP.get(cp, cp)),
                })

    return {
        "email": email,
        "suspicious": len(issues) > 0,
        "issues": issues,
    }

Technique 3: Filename Spoofing with BiDi Override

The right-to-left override (RLO, U+202E) character can reverse the displayed order of characters in a filename, hiding the true extension:

Actual filename Displayed as Technique
invoice\u202Efdp.exe invoice\u202Eexe.pdf RLO reverses "fdp.exe" to show "exe.pdf"
photo\u202Egnp.scr photo\u202Ercs.png RLO reverses "gnp.scr" to show "rcs.png"

The user sees "invoice...pdf" and opens what they believe is a PDF, but the operating system executes the .exe file.

Detection

BIDI_OVERRIDES = {
    0x202A, 0x202B, 0x202C, 0x202D, 0x202E,
    0x2066, 0x2067, 0x2068, 0x2069,
}

def check_filename_safety(filename):
    issues = []
    for i, ch in enumerate(filename):
        cp = ord(ch)
        if cp in BIDI_OVERRIDES:
            issues.append({
                "position": i,
                "codepoint": f"U+{cp:04X}",
                "name": unicodedata.name(ch, "UNKNOWN"),
                "risk": "BiDi override can disguise file extension",
            })
    return {
        "filename": repr(filename),
        "safe": len(issues) == 0,
        "issues": issues,
    }

# Check
result = check_filename_safety("invoice\u202Efdp.exe")
print(f"Safe: {result['safe']}")  # Safe: False

Prevention

  • Strip all BiDi override characters from filenames before display
  • Show file extensions in a separate, non-reversible UI element
  • Warn users when filenames contain Unicode control characters

Technique 4: Display Name Impersonation

Social platforms, messaging apps, and email clients display names that users set for themselves. Unicode enables near-perfect impersonation:

Legitimate Spoofed Method
admin аdmin Cyrillic а
admin admin\u200B Trailing ZWSP
John Smith Јohn Smith Cyrillic Ј
CEO СЕО All Cyrillic

Comprehensive Display Name Validation

def validate_display_name(name, check_confusables=True):
    issues = []

    # 1. Strip invisible characters
    visible = []
    for ch in name:
        cp = ord(ch)
        if cp in {0x200B, 0x200C, 0x200D, 0x200E, 0x200F,
                  0x202A, 0x202B, 0x202C, 0x202D, 0x202E,
                  0x2060, 0x2061, 0x2062, 0x2063, 0x2064,
                  0x2066, 0x2067, 0x2068, 0x2069, 0xFEFF}:
            issues.append(f"Invisible character U+{cp:04X} at position {len(visible)}")
        else:
            visible.append(ch)
    clean_name = "".join(visible)

    # 2. Check for mixed scripts
    scripts = set()
    for ch in clean_name:
        name_str = unicodedata.name(ch, "")
        for script in ["LATIN", "CYRILLIC", "GREEK", "ARABIC", "HEBREW"]:
            if script in name_str:
                scripts.add(script)
                break

    if len(scripts) > 1:
        issues.append(f"Mixed scripts detected: {scripts}")

    # 3. Check for confusables with reserved names
    if check_confusables:
        reserved = ["admin", "administrator", "moderator", "support", "system"]
        skel = compute_skeleton(clean_name.lower())
        for r in reserved:
            if compute_skeleton(r) == skel and clean_name.lower() != r:
                issues.append(f"Confusable with reserved name: {r}")

    return {
        "original": name,
        "cleaned": clean_name,
        "issues": issues,
        "safe": len(issues) == 0,
    }

Technique 5: URL Spoofing in Text

In messaging and social media, URLs embedded in text can use Unicode to appear as different URLs:

Displayed Text Actual URL Technique
https://bank.com/login https://bаnk.com/login Cyrillic а
Click here (any URL) HTML link text (not Unicode-specific)
bank.com\u2060/verify bank.com + WJ + /verify Word joiner in URL

URL Validation

def validate_url_text(url_text):
    issues = []

    # Check for non-ASCII in domain portion
    try:
        # Split URL into components
        from urllib.parse import urlparse
        parsed = urlparse(url_text)
        domain = parsed.hostname or ""

        for ch in domain:
            if ord(ch) > 0x007F:
                name = unicodedata.name(ch, f"U+{ord(ch):04X}")
                issues.append(
                    f"Non-ASCII in domain: '{ch}' ({name}) "
                    f"at position {domain.index(ch)}"
                )
    except Exception:
        issues.append("Failed to parse URL")

    # Check for invisible characters anywhere in URL
    for i, ch in enumerate(url_text):
        cp = ord(ch)
        if cp in {0x200B, 0x200C, 0x200D, 0x2060, 0xFEFF}:
            issues.append(f"Invisible char U+{cp:04X} at position {i}")

    return {
        "url": url_text,
        "suspicious": len(issues) > 0,
        "issues": issues,
    }

Building a Defense System

Layered Approach

Layer Check Action
Input validation Strip invisible chars, detect BiDi overrides Clean or reject
Script analysis Detect mixed scripts Warn or restrict
Confusable check Skeleton comparison against known brands Flag or block
Display safeguards Show Punycode for IDN domains Inform user
Monitoring Log Unicode anomalies Detect attacks

Implementation Checklist

For any system processing user-facing text:

Check Priority Context
Strip BiDi overrides from user input Critical All input fields
Normalize (NFC) before comparison Critical Auth, search, matching
Mixed-script detection for usernames High Registration, login
Confusable check against reserved names High Username registration
Strip invisible characters from identifiers High Usernames, slugs, IDs
IDN homograph check for URLs High Link preview, email
BiDi override check in filenames High File upload
Log Unicode anomalies Medium Security monitoring
Confusable check for display names Medium Social/messaging
Full UTS #39 compliance Ideal Enterprise applications

Key Takeaways

  • Unicode-based phishing exploits the gap between what text contains (code points) and what humans see (rendered glyphs), making visual verification impossible for confusable characters.
  • The five major techniques are: domain homograph attacks (confusable domain names), email address spoofing (confusable local parts), filename spoofing (BiDi override hiding extensions), display name impersonation (confusable usernames), and URL spoofing (confusable characters in links).
  • Effective defense requires a layered approach: input validation (strip invisible chars), script analysis (detect mixing), confusable detection (skeleton comparison), display safeguards (Punycode), and monitoring (log anomalies).
  • The Unicode Consortium's confusables.txt database and UTS #39 specification are the authoritative references — use libraries that implement them (ICU, PyICU) rather than hand-rolled confusable lists.
  • BiDi override characters should be stripped from virtually all user input — filenames, usernames, messages, and URLs — as they have almost no legitimate use in those contexts.
  • Prevention must be balanced with internationalization — not all non-Latin text is suspicious, and legitimate script combinations (Japanese Han+Kana, Korean Hangul+Hanja) must be allowed.

Lainnya di Unicode Security