IDN Homograph Attack Detection

Internationalized Domain Names (IDNs) allow domain names to contain characters from non-Latin scripts — enabling domains like "пример.com" (Russian), "例え.jp" (Japanese), or "مثال.com" (Arabic). While this is essential for a multilingual internet, it opens the door to IDN homograph attacks: registering domains that look identical to legitimate ones by substituting visually similar characters from different scripts. This guide explains how IDN homograph attacks work, how browsers and registrars defend against them, and how developers can detect and prevent script mixing in domain names.

How IDN Homograph Attacks Work

The Punycode Foundation

IDNs are encoded using Punycode (RFC 3492), which converts Unicode domain labels to ASCII-compatible encoding (ACE) prefixed with "xn--":

Display Form	Punycode (ACE)	Script
münchen.de	xn--mnchen-3ya.de	Latin + German umlaut
例え.jp	xn--r8jz45g.jp	Japanese
аррӏе.com	xn--80ak6aa92e.com	Cyrillic + Kazakh

DNS resolves the Punycode form. The browser displays the Unicode form for user convenience. The attack vector: an attacker registers the Punycode domain whose Unicode rendering looks like a legitimate domain.

A Classic Example

The most famous demonstration targeted "apple.com":

Domain	Characters	Script
apple.com	a-p-p-l-e	Latin (legitimate)
аррӏе.com	а-р-р-ӏ-е	Cyrillic а, р, р + Kazakh ӏ + Cyrillic е

In a browser that displays IDNs without protection, both appear as "apple.com" in the address bar. The Punycode form xn--80ak6aa92e.com reveals the deception, but users never see Punycode unless the browser intervenes.

Partial Homographs

Not all attacks require a perfect match. Partial homographs exploit the fact that humans read words holistically rather than letter-by-letter:

Legitimate	Homograph	Substituted Characters
google.com	gооgle.com	Two Cyrillic "о" (U+043E)
paypal.com	pаypal.com	One Cyrillic "а" (U+0430)
microsoft.com	miсrosoft.com	One Cyrillic "с" (U+0441)

Even a single substituted character can create a convincing spoof because users do not examine every character individually.

Browser Defenses

Modern browsers implement IDN display policies based on script analysis:

Chrome's Algorithm

Chrome (and Chromium-based browsers) display Punycode instead of Unicode when a domain label triggers any of these conditions:

Rule	Description
Mixed-script	Label mixes scripts that are not in an approved combination
Dangerous overlap	Label contains characters from scripts known for Latin confusables (Cyrillic, Greek)
Whole-script confusable	Entire label can be interpreted as another script
Single character	Label is a single non-ASCII character
ICU skeleton check	Skeleton of the label matches a top-domain skeleton

If triggered, the user sees xn--80ak6aa92e.com instead of аррӏе.com.

Firefox's Approach

Firefox uses a similar but slightly different algorithm:

Check if all characters are from a single script (safe if so)
Check against a list of allowed script combinations (e.g., Han + Hiragana + Katakana)
Apply confusable detection using Unicode confusables data
Fall back to Punycode display if any check fails

Safari's Policy

Safari takes one of the strictest approaches, displaying Punycode for any domain label that mixes scripts in ways not on Apple's allowlist.

Detecting Script Mixing in Code

Python Implementation

import unicodedata
from collections import defaultdict

def get_script(char):
    # Use Unicode character name to infer script
    # Production code should use ICU or the Unicode Script property
    name = unicodedata.name(char, "")
    if not name:
        return "UNKNOWN"

    # Check major scripts (simplified)
    script_keywords = [
        "LATIN", "CYRILLIC", "GREEK", "ARABIC", "HEBREW",
        "DEVANAGARI", "CJK", "HANGUL", "HIRAGANA", "KATAKANA",
        "THAI", "ARMENIAN", "GEORGIAN", "ETHIOPIC",
    ]
    for script in script_keywords:
        if script in name:
            return script

    category = unicodedata.category(char)
    if category.startswith("N"):
        return "COMMON"  # digits
    if category in ("Zs", "Pc", "Pd", "Ps", "Pe", "Po"):
        return "COMMON"  # punctuation/spaces
    return "COMMON"

def analyze_domain_label(label):
    scripts = defaultdict(list)
    for ch in label:
        script = get_script(ch)
        scripts[script].append(ch)

    non_common = {s for s in scripts if s != "COMMON" and s != "UNKNOWN"}

    result = {
        "label": label,
        "scripts_found": dict(scripts),
        "non_common_scripts": non_common,
        "is_mixed_script": len(non_common) > 1,
    }
    return result

# Test
print(analyze_domain_label("apple"))
# is_mixed_script: False (all Latin)

print(analyze_domain_label("\u0430\u0440\u0440\u04cf\u0435"))
# is_mixed_script: False (all Cyrillic family)

print(analyze_domain_label("g\u043e\u043egle"))
# is_mixed_script: True (Latin + Cyrillic)

JavaScript Implementation

// Using Unicode property escapes (ES2018+)
function getScript(char) {
  if (/\p{Script=Latin}/u.test(char)) return "Latin";
  if (/\p{Script=Cyrillic}/u.test(char)) return "Cyrillic";
  if (/\p{Script=Greek}/u.test(char)) return "Greek";
  if (/\p{Script=Arabic}/u.test(char)) return "Arabic";
  if (/\p{Script=Hebrew}/u.test(char)) return "Hebrew";
  if (/\p{Script=Han}/u.test(char)) return "Han";
  if (/\p{Script=Hiragana}/u.test(char)) return "Hiragana";
  if (/\p{Script=Katakana}/u.test(char)) return "Katakana";
  if (/\p{Script=Common}/u.test(char)) return "Common";
  return "Other";
}

function detectMixedScript(domain) {
  const scripts = new Set();
  for (const char of domain) {
    const script = getScript(char);
    if (script !== "Common" && script !== "Other") {
      scripts.add(script);
    }
  }
  return {
    scripts: [...scripts],
    isMixed: scripts.size > 1,
  };
}

console.log(detectMixedScript("apple.com"));
// { scripts: ["Latin"], isMixed: false }

console.log(detectMixedScript("g\u043E\u043Egle.com"));
// { scripts: ["Latin", "Cyrillic"], isMixed: true }

Confusable Skeleton Comparison

For production use, compare skeletons using the Unicode confusables data:

# Using the ICU library (PyICU)
# pip install PyICU
# import icu

# spoofChecker = icu.SpoofChecker()
# result = spoofChecker.areConfusable("apple", "\u0430\u0440\u0440\u04cf\u0435")
# Returns non-zero if confusable

# Alternative: manual skeleton comparison using confusables.txt
# Download from: https://unicode.org/Public/security/latest/confusables.txt

Registrar-Level Defenses

Domain registrars implement additional protections:

Defense	Description
Script-homogeneous labels	Require each label to use characters from a single script
Blocked confusable registrations	Prevent registration of domains confusable with existing ones
Registry-level restrictions	Some TLDs restrict allowed scripts (e.g., .рф requires Cyrillic)
ICANN IDN guidelines	Framework for TLD operators to define IDN policies

ICANN's Approach

ICANN's Label Generation Rules (LGR) define per-script character repertoires for the root zone. Each script community determines which characters are permissible and which cross-script variants should be blocked.

Allowed Script Combinations

Not all script mixing is malicious. Legitimate combinations include:

Combination	Example	Legitimate Use
Han + Hiragana + Katakana	東京タワー.jp	Standard Japanese
Latin + Han	university大学.cn	Bilingual Chinese
Latin + Common	café.com	Accented Latin
Hangul + Han	서울大.kr	Korean with Hanja

Browsers and validation libraries maintain allowlists of these legitimate combinations.

Key Takeaways

IDN homograph attacks exploit the fact that characters from different scripts (Latin "a" vs. Cyrillic "а") can look identical, enabling domain spoofing.
Browsers defend by displaying Punycode (xn--...) instead of Unicode when mixed-script or confusable domain labels are detected.
Detection requires script analysis (identifying which scripts are present in a label) and confusable comparison (checking if a label's skeleton matches a known domain).
Not all script mixing is malicious — Japanese domains legitimately combine Han, Hiragana, and Katakana. Allowlists of legitimate combinations are essential.
Registrars provide an additional defense layer through blocked confusable registrations and ICANN Label Generation Rules.
For production systems, use libraries with access to the Unicode confusables.txt database (ICU, PyICU) rather than hand-rolled heuristics.