🔒 Unicode Security

IDN Homograph Attack Detection

IDN homograph attacks use look-alike Unicode characters to register domain names that appear identical to legitimate ones — for example, replacing a Latin 'a' with a Cyrillic 'а'. This guide explains how to detect and prevent IDN homograph attacks in domain registration systems, browser UI, and link validation code.

·

Internationalized Domain Names (IDNs) allow domain names to contain characters from non-Latin scripts — enabling domains like "пример.com" (Russian), "例え.jp" (Japanese), or "مثال.com" (Arabic). While this is essential for a multilingual internet, it opens the door to IDN homograph attacks: registering domains that look identical to legitimate ones by substituting visually similar characters from different scripts. This guide explains how IDN homograph attacks work, how browsers and registrars defend against them, and how developers can detect and prevent script mixing in domain names.

How IDN Homograph Attacks Work

The Punycode Foundation

IDNs are encoded using Punycode (RFC 3492), which converts Unicode domain labels to ASCII-compatible encoding (ACE) prefixed with "xn--":

Display Form Punycode (ACE) Script
münchen.de xn--mnchen-3ya.de Latin + German umlaut
例え.jp xn--r8jz45g.jp Japanese
аррӏе.com xn--80ak6aa92e.com Cyrillic + Kazakh

DNS resolves the Punycode form. The browser displays the Unicode form for user convenience. The attack vector: an attacker registers the Punycode domain whose Unicode rendering looks like a legitimate domain.

A Classic Example

The most famous demonstration targeted "apple.com":

Domain Characters Script
apple.com a-p-p-l-e Latin (legitimate)
аррӏе.com а-р-р-ӏ-е Cyrillic а, р, р + Kazakh ӏ + Cyrillic е

In a browser that displays IDNs without protection, both appear as "apple.com" in the address bar. The Punycode form xn--80ak6aa92e.com reveals the deception, but users never see Punycode unless the browser intervenes.

Partial Homographs

Not all attacks require a perfect match. Partial homographs exploit the fact that humans read words holistically rather than letter-by-letter:

Legitimate Homograph Substituted Characters
google.com gооgle.com Two Cyrillic "о" (U+043E)
paypal.com pаypal.com One Cyrillic "а" (U+0430)
microsoft.com miсrosoft.com One Cyrillic "с" (U+0441)

Even a single substituted character can create a convincing spoof because users do not examine every character individually.

Browser Defenses

Modern browsers implement IDN display policies based on script analysis:

Chrome's Algorithm

Chrome (and Chromium-based browsers) display Punycode instead of Unicode when a domain label triggers any of these conditions:

Rule Description
Mixed-script Label mixes scripts that are not in an approved combination
Dangerous overlap Label contains characters from scripts known for Latin confusables (Cyrillic, Greek)
Whole-script confusable Entire label can be interpreted as another script
Single character Label is a single non-ASCII character
ICU skeleton check Skeleton of the label matches a top-domain skeleton

If triggered, the user sees xn--80ak6aa92e.com instead of аррӏе.com.

Firefox's Approach

Firefox uses a similar but slightly different algorithm:

  1. Check if all characters are from a single script (safe if so)
  2. Check against a list of allowed script combinations (e.g., Han + Hiragana + Katakana)
  3. Apply confusable detection using Unicode confusables data
  4. Fall back to Punycode display if any check fails

Safari's Policy

Safari takes one of the strictest approaches, displaying Punycode for any domain label that mixes scripts in ways not on Apple's allowlist.

Detecting Script Mixing in Code

Python Implementation

import unicodedata
from collections import defaultdict

def get_script(char):
    # Use Unicode character name to infer script
    # Production code should use ICU or the Unicode Script property
    name = unicodedata.name(char, "")
    if not name:
        return "UNKNOWN"

    # Check major scripts (simplified)
    script_keywords = [
        "LATIN", "CYRILLIC", "GREEK", "ARABIC", "HEBREW",
        "DEVANAGARI", "CJK", "HANGUL", "HIRAGANA", "KATAKANA",
        "THAI", "ARMENIAN", "GEORGIAN", "ETHIOPIC",
    ]
    for script in script_keywords:
        if script in name:
            return script

    category = unicodedata.category(char)
    if category.startswith("N"):
        return "COMMON"  # digits
    if category in ("Zs", "Pc", "Pd", "Ps", "Pe", "Po"):
        return "COMMON"  # punctuation/spaces
    return "COMMON"

def analyze_domain_label(label):
    scripts = defaultdict(list)
    for ch in label:
        script = get_script(ch)
        scripts[script].append(ch)

    non_common = {s for s in scripts if s != "COMMON" and s != "UNKNOWN"}

    result = {
        "label": label,
        "scripts_found": dict(scripts),
        "non_common_scripts": non_common,
        "is_mixed_script": len(non_common) > 1,
    }
    return result

# Test
print(analyze_domain_label("apple"))
# is_mixed_script: False (all Latin)

print(analyze_domain_label("\u0430\u0440\u0440\u04cf\u0435"))
# is_mixed_script: False (all Cyrillic family)

print(analyze_domain_label("g\u043e\u043egle"))
# is_mixed_script: True (Latin + Cyrillic)

JavaScript Implementation

// Using Unicode property escapes (ES2018+)
function getScript(char) {
  if (/\p{Script=Latin}/u.test(char)) return "Latin";
  if (/\p{Script=Cyrillic}/u.test(char)) return "Cyrillic";
  if (/\p{Script=Greek}/u.test(char)) return "Greek";
  if (/\p{Script=Arabic}/u.test(char)) return "Arabic";
  if (/\p{Script=Hebrew}/u.test(char)) return "Hebrew";
  if (/\p{Script=Han}/u.test(char)) return "Han";
  if (/\p{Script=Hiragana}/u.test(char)) return "Hiragana";
  if (/\p{Script=Katakana}/u.test(char)) return "Katakana";
  if (/\p{Script=Common}/u.test(char)) return "Common";
  return "Other";
}

function detectMixedScript(domain) {
  const scripts = new Set();
  for (const char of domain) {
    const script = getScript(char);
    if (script !== "Common" && script !== "Other") {
      scripts.add(script);
    }
  }
  return {
    scripts: [...scripts],
    isMixed: scripts.size > 1,
  };
}

console.log(detectMixedScript("apple.com"));
// { scripts: ["Latin"], isMixed: false }

console.log(detectMixedScript("g\u043E\u043Egle.com"));
// { scripts: ["Latin", "Cyrillic"], isMixed: true }

Confusable Skeleton Comparison

For production use, compare skeletons using the Unicode confusables data:

# Using the ICU library (PyICU)
# pip install PyICU
# import icu

# spoofChecker = icu.SpoofChecker()
# result = spoofChecker.areConfusable("apple", "\u0430\u0440\u0440\u04cf\u0435")
# Returns non-zero if confusable

# Alternative: manual skeleton comparison using confusables.txt
# Download from: https://unicode.org/Public/security/latest/confusables.txt

Registrar-Level Defenses

Domain registrars implement additional protections:

Defense Description
Script-homogeneous labels Require each label to use characters from a single script
Blocked confusable registrations Prevent registration of domains confusable with existing ones
Registry-level restrictions Some TLDs restrict allowed scripts (e.g., .рф requires Cyrillic)
ICANN IDN guidelines Framework for TLD operators to define IDN policies

ICANN's Approach

ICANN's Label Generation Rules (LGR) define per-script character repertoires for the root zone. Each script community determines which characters are permissible and which cross-script variants should be blocked.

Allowed Script Combinations

Not all script mixing is malicious. Legitimate combinations include:

Combination Example Legitimate Use
Han + Hiragana + Katakana 東京タワー.jp Standard Japanese
Latin + Han university大学.cn Bilingual Chinese
Latin + Common café.com Accented Latin
Hangul + Han 서울大.kr Korean with Hanja

Browsers and validation libraries maintain allowlists of these legitimate combinations.

Key Takeaways

  • IDN homograph attacks exploit the fact that characters from different scripts (Latin "a" vs. Cyrillic "а") can look identical, enabling domain spoofing.
  • Browsers defend by displaying Punycode (xn--...) instead of Unicode when mixed-script or confusable domain labels are detected.
  • Detection requires script analysis (identifying which scripts are present in a label) and confusable comparison (checking if a label's skeleton matches a known domain).
  • Not all script mixing is malicious — Japanese domains legitimately combine Han, Hiragana, and Katakana. Allowlists of legitimate combinations are essential.
  • Registrars provide an additional defense layer through blocked confusable registrations and ICANN Label Generation Rules.
  • For production systems, use libraries with access to the Unicode confusables.txt database (ICU, PyICU) rather than hand-rolled heuristics.

Unicode Security 中的更多内容