IDN Homograph Attack Detection
IDN homograph attacks use look-alike Unicode characters to register domain names that appear identical to legitimate ones — for example, replacing a Latin 'a' with a Cyrillic 'а'. This guide explains how to detect and prevent IDN homograph attacks in domain registration systems, browser UI, and link validation code.
Internationalized Domain Names (IDNs) allow domain names to contain characters from non-Latin scripts — enabling domains like "пример.com" (Russian), "例え.jp" (Japanese), or "مثال.com" (Arabic). While this is essential for a multilingual internet, it opens the door to IDN homograph attacks: registering domains that look identical to legitimate ones by substituting visually similar characters from different scripts. This guide explains how IDN homograph attacks work, how browsers and registrars defend against them, and how developers can detect and prevent script mixing in domain names.
How IDN Homograph Attacks Work
The Punycode Foundation
IDNs are encoded using Punycode (RFC 3492), which converts Unicode domain labels to ASCII-compatible encoding (ACE) prefixed with "xn--":
| Display Form | Punycode (ACE) | Script |
|---|---|---|
| münchen.de | xn--mnchen-3ya.de | Latin + German umlaut |
| 例え.jp | xn--r8jz45g.jp | Japanese |
| аррӏе.com | xn--80ak6aa92e.com | Cyrillic + Kazakh |
DNS resolves the Punycode form. The browser displays the Unicode form for user convenience. The attack vector: an attacker registers the Punycode domain whose Unicode rendering looks like a legitimate domain.
A Classic Example
The most famous demonstration targeted "apple.com":
| Domain | Characters | Script |
|---|---|---|
| apple.com | a-p-p-l-e | Latin (legitimate) |
| аррӏе.com | а-р-р-ӏ-е | Cyrillic а, р, р + Kazakh ӏ + Cyrillic е |
In a browser that displays IDNs without protection, both appear as "apple.com" in the address bar. The Punycode form xn--80ak6aa92e.com reveals the deception, but users never see Punycode unless the browser intervenes.
Partial Homographs
Not all attacks require a perfect match. Partial homographs exploit the fact that humans read words holistically rather than letter-by-letter:
| Legitimate | Homograph | Substituted Characters |
|---|---|---|
| google.com | gооgle.com | Two Cyrillic "о" (U+043E) |
| paypal.com | pаypal.com | One Cyrillic "а" (U+0430) |
| microsoft.com | miсrosoft.com | One Cyrillic "с" (U+0441) |
Even a single substituted character can create a convincing spoof because users do not examine every character individually.
Browser Defenses
Modern browsers implement IDN display policies based on script analysis:
Chrome's Algorithm
Chrome (and Chromium-based browsers) display Punycode instead of Unicode when a domain label triggers any of these conditions:
| Rule | Description |
|---|---|
| Mixed-script | Label mixes scripts that are not in an approved combination |
| Dangerous overlap | Label contains characters from scripts known for Latin confusables (Cyrillic, Greek) |
| Whole-script confusable | Entire label can be interpreted as another script |
| Single character | Label is a single non-ASCII character |
| ICU skeleton check | Skeleton of the label matches a top-domain skeleton |
If triggered, the user sees xn--80ak6aa92e.com instead of аррӏе.com.
Firefox's Approach
Firefox uses a similar but slightly different algorithm:
- Check if all characters are from a single script (safe if so)
- Check against a list of allowed script combinations (e.g., Han + Hiragana + Katakana)
- Apply confusable detection using Unicode confusables data
- Fall back to Punycode display if any check fails
Safari's Policy
Safari takes one of the strictest approaches, displaying Punycode for any domain label that mixes scripts in ways not on Apple's allowlist.
Detecting Script Mixing in Code
Python Implementation
import unicodedata
from collections import defaultdict
def get_script(char):
# Use Unicode character name to infer script
# Production code should use ICU or the Unicode Script property
name = unicodedata.name(char, "")
if not name:
return "UNKNOWN"
# Check major scripts (simplified)
script_keywords = [
"LATIN", "CYRILLIC", "GREEK", "ARABIC", "HEBREW",
"DEVANAGARI", "CJK", "HANGUL", "HIRAGANA", "KATAKANA",
"THAI", "ARMENIAN", "GEORGIAN", "ETHIOPIC",
]
for script in script_keywords:
if script in name:
return script
category = unicodedata.category(char)
if category.startswith("N"):
return "COMMON" # digits
if category in ("Zs", "Pc", "Pd", "Ps", "Pe", "Po"):
return "COMMON" # punctuation/spaces
return "COMMON"
def analyze_domain_label(label):
scripts = defaultdict(list)
for ch in label:
script = get_script(ch)
scripts[script].append(ch)
non_common = {s for s in scripts if s != "COMMON" and s != "UNKNOWN"}
result = {
"label": label,
"scripts_found": dict(scripts),
"non_common_scripts": non_common,
"is_mixed_script": len(non_common) > 1,
}
return result
# Test
print(analyze_domain_label("apple"))
# is_mixed_script: False (all Latin)
print(analyze_domain_label("\u0430\u0440\u0440\u04cf\u0435"))
# is_mixed_script: False (all Cyrillic family)
print(analyze_domain_label("g\u043e\u043egle"))
# is_mixed_script: True (Latin + Cyrillic)
JavaScript Implementation
// Using Unicode property escapes (ES2018+)
function getScript(char) {
if (/\p{Script=Latin}/u.test(char)) return "Latin";
if (/\p{Script=Cyrillic}/u.test(char)) return "Cyrillic";
if (/\p{Script=Greek}/u.test(char)) return "Greek";
if (/\p{Script=Arabic}/u.test(char)) return "Arabic";
if (/\p{Script=Hebrew}/u.test(char)) return "Hebrew";
if (/\p{Script=Han}/u.test(char)) return "Han";
if (/\p{Script=Hiragana}/u.test(char)) return "Hiragana";
if (/\p{Script=Katakana}/u.test(char)) return "Katakana";
if (/\p{Script=Common}/u.test(char)) return "Common";
return "Other";
}
function detectMixedScript(domain) {
const scripts = new Set();
for (const char of domain) {
const script = getScript(char);
if (script !== "Common" && script !== "Other") {
scripts.add(script);
}
}
return {
scripts: [...scripts],
isMixed: scripts.size > 1,
};
}
console.log(detectMixedScript("apple.com"));
// { scripts: ["Latin"], isMixed: false }
console.log(detectMixedScript("g\u043E\u043Egle.com"));
// { scripts: ["Latin", "Cyrillic"], isMixed: true }
Confusable Skeleton Comparison
For production use, compare skeletons using the Unicode confusables data:
# Using the ICU library (PyICU)
# pip install PyICU
# import icu
# spoofChecker = icu.SpoofChecker()
# result = spoofChecker.areConfusable("apple", "\u0430\u0440\u0440\u04cf\u0435")
# Returns non-zero if confusable
# Alternative: manual skeleton comparison using confusables.txt
# Download from: https://unicode.org/Public/security/latest/confusables.txt
Registrar-Level Defenses
Domain registrars implement additional protections:
| Defense | Description |
|---|---|
| Script-homogeneous labels | Require each label to use characters from a single script |
| Blocked confusable registrations | Prevent registration of domains confusable with existing ones |
| Registry-level restrictions | Some TLDs restrict allowed scripts (e.g., .рф requires Cyrillic) |
| ICANN IDN guidelines | Framework for TLD operators to define IDN policies |
ICANN's Approach
ICANN's Label Generation Rules (LGR) define per-script character repertoires for the root zone. Each script community determines which characters are permissible and which cross-script variants should be blocked.
Allowed Script Combinations
Not all script mixing is malicious. Legitimate combinations include:
| Combination | Example | Legitimate Use |
|---|---|---|
| Han + Hiragana + Katakana | 東京タワー.jp | Standard Japanese |
| Latin + Han | university大学.cn | Bilingual Chinese |
| Latin + Common | café.com | Accented Latin |
| Hangul + Han | 서울大.kr | Korean with Hanja |
Browsers and validation libraries maintain allowlists of these legitimate combinations.
Key Takeaways
- IDN homograph attacks exploit the fact that characters from different scripts (Latin "a" vs. Cyrillic "а") can look identical, enabling domain spoofing.
- Browsers defend by displaying Punycode (xn--...) instead of Unicode when mixed-script or confusable domain labels are detected.
- Detection requires script analysis (identifying which scripts are present in a label) and confusable comparison (checking if a label's skeleton matches a known domain).
- Not all script mixing is malicious — Japanese domains legitimately combine Han, Hiragana, and Katakana. Allowlists of legitimate combinations are essential.
- Registrars provide an additional defense layer through blocked confusable registrations and ICANN Label Generation Rules.
- For production systems, use libraries with access to the Unicode confusables.txt database (ICU, PyICU) rather than hand-rolled heuristics.
Unicode Security의 더 많은 가이드
Unicode's vast character set introduces a range of security vulnerabilities including homograph …
Zero-width and other invisible Unicode characters can be used to fingerprint text …
Unicode passwords introduce normalization ambiguity that can cause authentication failures or allow …
Phishing attacks increasingly exploit Unicode confusables, bidirectional overrides, and invisible characters to …