Preventing Unicode-based Phishing
Phishing attacks increasingly exploit Unicode confusables, bidirectional overrides, and invisible characters to create deceptive URLs, spoofed sender addresses, and misleading link text. This guide covers the techniques used in Unicode-based phishing attacks and the detection, prevention, and user-education strategies to defend against them.
Phishing attacks cost billions of dollars annually, and Unicode provides attackers with a sophisticated toolkit for making malicious content look legitimate. Visual spoofing through confusable characters, homograph domain attacks, invisible text manipulation, and deceptive display of URLs and email addresses all exploit the gap between what Unicode text contains and what humans see. This guide covers the major Unicode-based phishing techniques, detection algorithms, and prevention strategies for developers building systems that handle user-facing text, URLs, email addresses, and identity information.
The Landscape of Unicode Phishing
Unicode-based phishing differs from traditional phishing in a fundamental way: instead of relying on user carelessness ("did you notice the URL was amaz0n.com?"), it exploits the impossibility of visual verification. When Latin "a" and Cyrillic "а" are pixel-identical in common fonts, no amount of user education can distinguish them.
Attack Surface
| Vector | Technique | Impact |
|---|---|---|
| Domain names | IDN homograph attacks | User visits malicious site believing it is legitimate |
| Email addresses | Confusable local parts | User replies to attacker's email |
| Display names | Visual impersonation | User trusts message from "admin" (Cyrillic а) |
| URLs in text | Mixed-script URLs | User clicks link they believe goes to trusted site |
| File names | Bidirectional override in filename | User opens "invoice.pdf" that is actually .exe |
| UI text | Invisible characters in labels | Buttons/labels show different text than they contain |
Technique 1: Domain Homograph Attacks
The most well-documented Unicode phishing technique uses visually confusable characters to register domains that appear identical to legitimate ones:
| Target | Homograph | Substitution |
|---|---|---|
| apple.com | аррӏе.com | Cyrillic а, р, ӏ, е |
| google.com | ɡооɡӏе.com | Latin ɡ, Cyrillic о, ӏ, е |
| paypal.com | раураӏ.com | Cyrillic р, а, у, ӏ |
| facebook.com | fаcebook.com | Single Cyrillic а |
Detection
import unicodedata
# Known confusable pairs (subset — full list in Unicode confusables.txt)
CONFUSABLE_MAP = {
0x0430: 0x0061, # Cyrillic а -> Latin a
0x0435: 0x0065, # Cyrillic е -> Latin e
0x043E: 0x006F, # Cyrillic о -> Latin o
0x0440: 0x0070, # Cyrillic р -> Latin p
0x0441: 0x0063, # Cyrillic с -> Latin c
0x0443: 0x0079, # Cyrillic у -> Latin y
0x0445: 0x0078, # Cyrillic х -> Latin x
0x04CF: 0x006C, # Cyrillic ӏ -> Latin l
0x0261: 0x0067, # Latin ɡ -> Latin g
0x03B1: 0x0061, # Greek α -> Latin a
0x03BF: 0x006F, # Greek ο -> Latin o
}
def compute_skeleton(text):
skeleton = []
for ch in text:
cp = ord(ch)
# Map confusable to prototype
mapped = CONFUSABLE_MAP.get(cp, cp)
skeleton.append(chr(mapped))
# Normalize result
return unicodedata.normalize("NFD", "".join(skeleton))
def is_homograph(domain, target):
return compute_skeleton(domain) == compute_skeleton(target)
# Test
print(is_homograph("аррӏе", "apple")) # True
print(is_homograph("google", "google")) # True
print(is_homograph("g\u043e\u043egle", "google")) # True
Prevention
| Layer | Defense |
|---|---|
| Browser | Display Punycode for mixed-script domains |
| Registrar | Block registration of confusable domains |
| Email gateway | Flag emails from confusable domains |
| Application | Skeleton comparison against known brand domains |
| Certificate Authority | Verify domain ownership for look-alike requests |
Technique 2: Email Address Spoofing
Email local parts (before the @) can contain Unicode characters via EAI (Email Address Internationalization, RFC 6531). An attacker can register an email address with confusable characters:
| Legitimate | Spoofed | Difference |
|---|---|---|
| [email protected] | а[email protected] | Cyrillic а in "аdmin" |
| [email protected] | ѕ[email protected] | Cyrillic ѕ in "ѕupport" |
Detection for Email
def check_email_confusables(email):
local_part, _, domain = email.rpartition("@")
issues = []
for i, ch in enumerate(local_part):
cp = ord(ch)
name = unicodedata.name(ch, "")
# Flag non-Latin characters in a predominantly Latin context
if cp > 0x007F:
script = "UNKNOWN"
for s in ["CYRILLIC", "GREEK", "ARMENIAN"]:
if s in name:
script = s
break
if script != "UNKNOWN":
issues.append({
"position": i,
"char": ch,
"codepoint": f"U+{cp:04X}",
"script": script,
"confusable_with": chr(CONFUSABLE_MAP.get(cp, cp)),
})
return {
"email": email,
"suspicious": len(issues) > 0,
"issues": issues,
}
Technique 3: Filename Spoofing with BiDi Override
The right-to-left override (RLO, U+202E) character can reverse the displayed order of characters in a filename, hiding the true extension:
| Actual filename | Displayed as | Technique |
|---|---|---|
| invoice\u202Efdp.exe | invoice\u202Eexe.pdf | RLO reverses "fdp.exe" to show "exe.pdf" |
| photo\u202Egnp.scr | photo\u202Ercs.png | RLO reverses "gnp.scr" to show "rcs.png" |
The user sees "invoice...pdf" and opens what they believe is a PDF, but the operating system executes the .exe file.
Detection
BIDI_OVERRIDES = {
0x202A, 0x202B, 0x202C, 0x202D, 0x202E,
0x2066, 0x2067, 0x2068, 0x2069,
}
def check_filename_safety(filename):
issues = []
for i, ch in enumerate(filename):
cp = ord(ch)
if cp in BIDI_OVERRIDES:
issues.append({
"position": i,
"codepoint": f"U+{cp:04X}",
"name": unicodedata.name(ch, "UNKNOWN"),
"risk": "BiDi override can disguise file extension",
})
return {
"filename": repr(filename),
"safe": len(issues) == 0,
"issues": issues,
}
# Check
result = check_filename_safety("invoice\u202Efdp.exe")
print(f"Safe: {result['safe']}") # Safe: False
Prevention
- Strip all BiDi override characters from filenames before display
- Show file extensions in a separate, non-reversible UI element
- Warn users when filenames contain Unicode control characters
Technique 4: Display Name Impersonation
Social platforms, messaging apps, and email clients display names that users set for themselves. Unicode enables near-perfect impersonation:
| Legitimate | Spoofed | Method |
|---|---|---|
| admin | аdmin | Cyrillic а |
| admin | admin\u200B | Trailing ZWSP |
| John Smith | Јohn Smith | Cyrillic Ј |
| CEO | СЕО | All Cyrillic |
Comprehensive Display Name Validation
def validate_display_name(name, check_confusables=True):
issues = []
# 1. Strip invisible characters
visible = []
for ch in name:
cp = ord(ch)
if cp in {0x200B, 0x200C, 0x200D, 0x200E, 0x200F,
0x202A, 0x202B, 0x202C, 0x202D, 0x202E,
0x2060, 0x2061, 0x2062, 0x2063, 0x2064,
0x2066, 0x2067, 0x2068, 0x2069, 0xFEFF}:
issues.append(f"Invisible character U+{cp:04X} at position {len(visible)}")
else:
visible.append(ch)
clean_name = "".join(visible)
# 2. Check for mixed scripts
scripts = set()
for ch in clean_name:
name_str = unicodedata.name(ch, "")
for script in ["LATIN", "CYRILLIC", "GREEK", "ARABIC", "HEBREW"]:
if script in name_str:
scripts.add(script)
break
if len(scripts) > 1:
issues.append(f"Mixed scripts detected: {scripts}")
# 3. Check for confusables with reserved names
if check_confusables:
reserved = ["admin", "administrator", "moderator", "support", "system"]
skel = compute_skeleton(clean_name.lower())
for r in reserved:
if compute_skeleton(r) == skel and clean_name.lower() != r:
issues.append(f"Confusable with reserved name: {r}")
return {
"original": name,
"cleaned": clean_name,
"issues": issues,
"safe": len(issues) == 0,
}
Technique 5: URL Spoofing in Text
In messaging and social media, URLs embedded in text can use Unicode to appear as different URLs:
| Displayed Text | Actual URL | Technique |
|---|---|---|
| https://bank.com/login | https://bаnk.com/login | Cyrillic а |
| Click here | (any URL) | HTML link text (not Unicode-specific) |
| bank.com\u2060/verify | bank.com + WJ + /verify | Word joiner in URL |
URL Validation
def validate_url_text(url_text):
issues = []
# Check for non-ASCII in domain portion
try:
# Split URL into components
from urllib.parse import urlparse
parsed = urlparse(url_text)
domain = parsed.hostname or ""
for ch in domain:
if ord(ch) > 0x007F:
name = unicodedata.name(ch, f"U+{ord(ch):04X}")
issues.append(
f"Non-ASCII in domain: '{ch}' ({name}) "
f"at position {domain.index(ch)}"
)
except Exception:
issues.append("Failed to parse URL")
# Check for invisible characters anywhere in URL
for i, ch in enumerate(url_text):
cp = ord(ch)
if cp in {0x200B, 0x200C, 0x200D, 0x2060, 0xFEFF}:
issues.append(f"Invisible char U+{cp:04X} at position {i}")
return {
"url": url_text,
"suspicious": len(issues) > 0,
"issues": issues,
}
Building a Defense System
Layered Approach
| Layer | Check | Action |
|---|---|---|
| Input validation | Strip invisible chars, detect BiDi overrides | Clean or reject |
| Script analysis | Detect mixed scripts | Warn or restrict |
| Confusable check | Skeleton comparison against known brands | Flag or block |
| Display safeguards | Show Punycode for IDN domains | Inform user |
| Monitoring | Log Unicode anomalies | Detect attacks |
Implementation Checklist
For any system processing user-facing text:
| Check | Priority | Context |
|---|---|---|
| Strip BiDi overrides from user input | Critical | All input fields |
| Normalize (NFC) before comparison | Critical | Auth, search, matching |
| Mixed-script detection for usernames | High | Registration, login |
| Confusable check against reserved names | High | Username registration |
| Strip invisible characters from identifiers | High | Usernames, slugs, IDs |
| IDN homograph check for URLs | High | Link preview, email |
| BiDi override check in filenames | High | File upload |
| Log Unicode anomalies | Medium | Security monitoring |
| Confusable check for display names | Medium | Social/messaging |
| Full UTS #39 compliance | Ideal | Enterprise applications |
Key Takeaways
- Unicode-based phishing exploits the gap between what text contains (code points) and what humans see (rendered glyphs), making visual verification impossible for confusable characters.
- The five major techniques are: domain homograph attacks (confusable domain names), email address spoofing (confusable local parts), filename spoofing (BiDi override hiding extensions), display name impersonation (confusable usernames), and URL spoofing (confusable characters in links).
- Effective defense requires a layered approach: input validation (strip invisible chars), script analysis (detect mixing), confusable detection (skeleton comparison), display safeguards (Punycode), and monitoring (log anomalies).
- The Unicode Consortium's confusables.txt database and UTS #39 specification are the authoritative references — use libraries that implement them (ICU, PyICU) rather than hand-rolled confusable lists.
- BiDi override characters should be stripped from virtually all user input — filenames, usernames, messages, and URLs — as they have almost no legitimate use in those contexts.
- Prevention must be balanced with internationalization — not all non-Latin text is suspicious, and legitimate script combinations (Japanese Han+Kana, Korean Hangul+Hanja) must be allowed.
Unicode Security içinde daha fazlası
Unicode's vast character set introduces a range of security vulnerabilities including homograph …
IDN homograph attacks use look-alike Unicode characters to register domain names that …
Zero-width and other invisible Unicode characters can be used to fingerprint text …
Unicode passwords introduce normalization ambiguity that can cause authentication failures or allow …