Preventing Unicode-based Phishing

Phishing attacks cost billions of dollars annually, and Unicode provides attackers with a sophisticated toolkit for making malicious content look legitimate. Visual spoofing through confusable characters, homograph domain attacks, invisible text manipulation, and deceptive display of URLs and email addresses all exploit the gap between what Unicode text contains and what humans see. This guide covers the major Unicode-based phishing techniques, detection algorithms, and prevention strategies for developers building systems that handle user-facing text, URLs, email addresses, and identity information.

The Landscape of Unicode Phishing

Unicode-based phishing differs from traditional phishing in a fundamental way: instead of relying on user carelessness ("did you notice the URL was amaz0n.com?"), it exploits the impossibility of visual verification. When Latin "a" and Cyrillic "а" are pixel-identical in common fonts, no amount of user education can distinguish them.

Attack Surface

Vector	Technique	Impact
Domain names	IDN homograph attacks	User visits malicious site believing it is legitimate
Email addresses	Confusable local parts	User replies to attacker's email
Display names	Visual impersonation	User trusts message from "admin" (Cyrillic а)
URLs in text	Mixed-script URLs	User clicks link they believe goes to trusted site
File names	Bidirectional override in filename	User opens "invoice.pdf" that is actually .exe
UI text	Invisible characters in labels	Buttons/labels show different text than they contain

Technique 1: Domain Homograph Attacks

The most well-documented Unicode phishing technique uses visually confusable characters to register domains that appear identical to legitimate ones:

Target	Homograph	Substitution
apple.com	аррӏе.com	Cyrillic а, р, ӏ, е
google.com	ɡооɡӏе.com	Latin ɡ, Cyrillic о, ӏ, е
paypal.com	раураӏ.com	Cyrillic р, а, у, ӏ
facebook.com	fаcebook.com	Single Cyrillic а

Detection

import unicodedata

# Known confusable pairs (subset — full list in Unicode confusables.txt)
CONFUSABLE_MAP = {
    0x0430: 0x0061,  # Cyrillic а -> Latin a
    0x0435: 0x0065,  # Cyrillic е -> Latin e
    0x043E: 0x006F,  # Cyrillic о -> Latin o
    0x0440: 0x0070,  # Cyrillic р -> Latin p
    0x0441: 0x0063,  # Cyrillic с -> Latin c
    0x0443: 0x0079,  # Cyrillic у -> Latin y
    0x0445: 0x0078,  # Cyrillic х -> Latin x
    0x04CF: 0x006C,  # Cyrillic ӏ -> Latin l
    0x0261: 0x0067,  # Latin ɡ -> Latin g
    0x03B1: 0x0061,  # Greek α -> Latin a
    0x03BF: 0x006F,  # Greek ο -> Latin o
}

def compute_skeleton(text):
    skeleton = []
    for ch in text:
        cp = ord(ch)
        # Map confusable to prototype
        mapped = CONFUSABLE_MAP.get(cp, cp)
        skeleton.append(chr(mapped))
    # Normalize result
    return unicodedata.normalize("NFD", "".join(skeleton))

def is_homograph(domain, target):
    return compute_skeleton(domain) == compute_skeleton(target)

# Test
print(is_homograph("аррӏе", "apple"))  # True
print(is_homograph("google", "google"))  # True
print(is_homograph("g\u043e\u043egle", "google"))  # True

Prevention

Layer	Defense
Browser	Display Punycode for mixed-script domains
Registrar	Block registration of confusable domains
Email gateway	Flag emails from confusable domains
Application	Skeleton comparison against known brand domains
Certificate Authority	Verify domain ownership for look-alike requests

Technique 2: Email Address Spoofing

Email local parts (before the @) can contain Unicode characters via EAI (Email Address Internationalization, RFC 6531). An attacker can register an email address with confusable characters:

Legitimate	Spoofed	Difference
[email protected]	а[email protected]	Cyrillic а in "аdmin"
[email protected]	ѕ[email protected]	Cyrillic ѕ in "ѕupport"

Detection for Email

def check_email_confusables(email):
    local_part, _, domain = email.rpartition("@")
    issues = []

    for i, ch in enumerate(local_part):
        cp = ord(ch)
        name = unicodedata.name(ch, "")

        # Flag non-Latin characters in a predominantly Latin context
        if cp > 0x007F:
            script = "UNKNOWN"
            for s in ["CYRILLIC", "GREEK", "ARMENIAN"]:
                if s in name:
                    script = s
                    break
            if script != "UNKNOWN":
                issues.append({
                    "position": i,
                    "char": ch,
                    "codepoint": f"U+{cp:04X}",
                    "script": script,
                    "confusable_with": chr(CONFUSABLE_MAP.get(cp, cp)),
                })

    return {
        "email": email,
        "suspicious": len(issues) > 0,
        "issues": issues,
    }

Technique 3: Filename Spoofing with BiDi Override

The right-to-left override (RLO, U+202E) character can reverse the displayed order of characters in a filename, hiding the true extension:

Actual filename	Displayed as	Technique
invoice\u202Efdp.exe	invoice\u202Eexe.pdf	RLO reverses "fdp.exe" to show "exe.pdf"
photo\u202Egnp.scr	photo\u202Ercs.png	RLO reverses "gnp.scr" to show "rcs.png"

The user sees "invoice...pdf" and opens what they believe is a PDF, but the operating system executes the .exe file.

Detection

BIDI_OVERRIDES = {
    0x202A, 0x202B, 0x202C, 0x202D, 0x202E,
    0x2066, 0x2067, 0x2068, 0x2069,
}

def check_filename_safety(filename):
    issues = []
    for i, ch in enumerate(filename):
        cp = ord(ch)
        if cp in BIDI_OVERRIDES:
            issues.append({
                "position": i,
                "codepoint": f"U+{cp:04X}",
                "name": unicodedata.name(ch, "UNKNOWN"),
                "risk": "BiDi override can disguise file extension",
            })
    return {
        "filename": repr(filename),
        "safe": len(issues) == 0,
        "issues": issues,
    }

# Check
result = check_filename_safety("invoice\u202Efdp.exe")
print(f"Safe: {result['safe']}")  # Safe: False

Prevention

Strip all BiDi override characters from filenames before display
Show file extensions in a separate, non-reversible UI element
Warn users when filenames contain Unicode control characters

Technique 4: Display Name Impersonation

Social platforms, messaging apps, and email clients display names that users set for themselves. Unicode enables near-perfect impersonation:

Legitimate	Spoofed	Method
admin	аdmin	Cyrillic а
admin	admin\u200B	Trailing ZWSP
John Smith	Јohn Smith	Cyrillic Ј
CEO	СЕО	All Cyrillic

Comprehensive Display Name Validation

def validate_display_name(name, check_confusables=True):
    issues = []

    # 1. Strip invisible characters
    visible = []
    for ch in name:
        cp = ord(ch)
        if cp in {0x200B, 0x200C, 0x200D, 0x200E, 0x200F,
                  0x202A, 0x202B, 0x202C, 0x202D, 0x202E,
                  0x2060, 0x2061, 0x2062, 0x2063, 0x2064,
                  0x2066, 0x2067, 0x2068, 0x2069, 0xFEFF}:
            issues.append(f"Invisible character U+{cp:04X} at position {len(visible)}")
        else:
            visible.append(ch)
    clean_name = "".join(visible)

    # 2. Check for mixed scripts
    scripts = set()
    for ch in clean_name:
        name_str = unicodedata.name(ch, "")
        for script in ["LATIN", "CYRILLIC", "GREEK", "ARABIC", "HEBREW"]:
            if script in name_str:
                scripts.add(script)
                break

    if len(scripts) > 1:
        issues.append(f"Mixed scripts detected: {scripts}")

    # 3. Check for confusables with reserved names
    if check_confusables:
        reserved = ["admin", "administrator", "moderator", "support", "system"]
        skel = compute_skeleton(clean_name.lower())
        for r in reserved:
            if compute_skeleton(r) == skel and clean_name.lower() != r:
                issues.append(f"Confusable with reserved name: {r}")

    return {
        "original": name,
        "cleaned": clean_name,
        "issues": issues,
        "safe": len(issues) == 0,
    }

Technique 5: URL Spoofing in Text

In messaging and social media, URLs embedded in text can use Unicode to appear as different URLs:

Displayed Text	Actual URL	Technique
https://bank.com/login	https://bаnk.com/login	Cyrillic а
Click here	(any URL)	HTML link text (not Unicode-specific)
bank.com\u2060/verify	bank.com + WJ + /verify	Word joiner in URL

URL Validation

def validate_url_text(url_text):
    issues = []

    # Check for non-ASCII in domain portion
    try:
        # Split URL into components
        from urllib.parse import urlparse
        parsed = urlparse(url_text)
        domain = parsed.hostname or ""

        for ch in domain:
            if ord(ch) > 0x007F:
                name = unicodedata.name(ch, f"U+{ord(ch):04X}")
                issues.append(
                    f"Non-ASCII in domain: '{ch}' ({name}) "
                    f"at position {domain.index(ch)}"
                )
    except Exception:
        issues.append("Failed to parse URL")

    # Check for invisible characters anywhere in URL
    for i, ch in enumerate(url_text):
        cp = ord(ch)
        if cp in {0x200B, 0x200C, 0x200D, 0x2060, 0xFEFF}:
            issues.append(f"Invisible char U+{cp:04X} at position {i}")

    return {
        "url": url_text,
        "suspicious": len(issues) > 0,
        "issues": issues,
    }

Building a Defense System

Layered Approach

Layer	Check	Action
Input validation	Strip invisible chars, detect BiDi overrides	Clean or reject
Script analysis	Detect mixed scripts	Warn or restrict
Confusable check	Skeleton comparison against known brands	Flag or block
Display safeguards	Show Punycode for IDN domains	Inform user
Monitoring	Log Unicode anomalies	Detect attacks

Implementation Checklist

For any system processing user-facing text:

Check	Priority	Context
Strip BiDi overrides from user input	Critical	All input fields
Normalize (NFC) before comparison	Critical	Auth, search, matching
Mixed-script detection for usernames	High	Registration, login
Confusable check against reserved names	High	Username registration
Strip invisible characters from identifiers	High	Usernames, slugs, IDs
IDN homograph check for URLs	High	Link preview, email
BiDi override check in filenames	High	File upload
Log Unicode anomalies	Medium	Security monitoring
Confusable check for display names	Medium	Social/messaging
Full UTS #39 compliance	Ideal	Enterprise applications

Key Takeaways

Unicode-based phishing exploits the gap between what text contains (code points) and what humans see (rendered glyphs), making visual verification impossible for confusable characters.
The five major techniques are: domain homograph attacks (confusable domain names), email address spoofing (confusable local parts), filename spoofing (BiDi override hiding extensions), display name impersonation (confusable usernames), and URL spoofing (confusable characters in links).
Effective defense requires a layered approach: input validation (strip invisible chars), script analysis (detect mixing), confusable detection (skeleton comparison), display safeguards (Punycode), and monitoring (log anomalies).
The Unicode Consortium's confusables.txt database and UTS #39 specification are the authoritative references — use libraries that implement them (ICU, PyICU) rather than hand-rolled confusable lists.
BiDi override characters should be stripped from virtually all user input — filenames, usernames, messages, and URLs — as they have almost no legitimate use in those contexts.
Prevention must be balanced with internationalization — not all non-Latin text is suspicious, and legitimate script combinations (Japanese Han+Kana, Korean Hangul+Hanja) must be allowed.

The Landscape of Unicode Phishing

Attack Surface

Technique 1: Domain Homograph Attacks

Detection

Prevention

Technique 2: Email Address Spoofing

Detection for Email

Technique 3: Filename Spoofing with BiDi Override

Detection

Prevention

Technique 4: Display Name Impersonation

Comprehensive Display Name Validation

Technique 5: URL Spoofing in Text

URL Validation

Building a Defense System

Layered Approach

Implementation Checklist

Key Takeaways

Lainnya di Unicode Security