🔧 Practical Unicode

Unicode in Domain Names (IDN)

Internationalized Domain Names (IDNs) allow domain names to contain non-ASCII characters from any Unicode script, converted to ASCII-compatible encoding using the Punycode algorithm. This guide explains how IDNs work, how Punycode conversion functions, and the security risks posed by homograph attacks on international domain names.

·

The web was born in English. Early domain names were restricted to a tiny set of ASCII characters — letters a-z, digits 0-9, and hyphens. For the billions of internet users whose languages use Arabic, Chinese, Cyrillic, Devanagari, or any other non-Latin script, this restriction meant every domain name was foreign. Internationalized Domain Names (IDN) changed that by allowing Unicode characters in domain names through an encoding layer called Punycode. This guide explains how IDN works under the hood, walks through the encoding process, covers the security risks of homograph attacks, and details the IDNA 2008 rules that govern modern internationalized domains.

The Problem: ASCII-Only Domains

The Domain Name System (DNS) was designed in the 1980s with a strict character set. RFC 1035 defined domain name labels as sequences of ASCII letters, digits, and hyphens, with a maximum length of 63 octets per label. This restriction was baked into DNS protocol packets, resolver libraries, and every piece of network infrastructure in between.

For English speakers, this was invisible. For everyone else, it was a barrier:

Language Desired Domain Status Before IDN
Chinese 中国.cn Impossible
Arabic موقع.عرب Impossible
Russian пример.ru Impossible
Hindi उदाहरण.भारत Impossible
Korean 예시.한국 Impossible
Japanese 例え.日本 Impossible

Users could not type domain names in their own scripts. They had to memorize arbitrary ASCII transliterations, which was like asking English speakers to use domain names written in Cyrillic.

How IDN Works: The Encoding Stack

IDN solves the problem without changing DNS itself. Instead, it adds a presentation layer that converts Unicode domain labels into ASCII-compatible encoding (ACE) labels that DNS can handle natively. The process involves three key standards:

  1. IDNA (Internationalized Domain Names in Applications) — the overall framework
  2. Nameprep / UTS #46 — normalization and validation rules
  3. Punycode — the encoding algorithm (RFC 3492)

The Encoding Pipeline

When you type a Unicode domain name into a browser, the following steps occur:

User types:    münchen.de
       ↓
Step 1:  Normalize (NFC) → münchen.de
Step 2:  Check validity (IDNA rules)
Step 3:  Punycode encode → xn--mnchen-3ya.de
Step 4:  DNS lookup on xn--mnchen-3ya.de
Step 5:  Display münchen.de to user

The xn-- prefix is the ACE prefix that signals "this label is Punycode-encoded." Every internationalized label in DNS starts with xn--.

Punycode: The Encoding Algorithm

Punycode (RFC 3492) is a bootstring encoding that represents Unicode code points using only ASCII characters. It works by separating the ASCII characters in a label from the non-ASCII ones, then encoding the positions and values of the non-ASCII characters as a compact ASCII string.

Here are some real-world examples:

Unicode Label Punycode (ACE) Language
münchen xn--mnchen-3ya German
рф xn--p1acf Russian
中国 xn--fiqs8s Chinese (Simplified)
ไทย xn--o3cw4h Thai
مصر xn--wgbh1c Arabic (Egypt)
България xn--d1alf Bulgarian

Encoding in Python

Python's encodings.idna module and the idna library handle the full pipeline:

# Built-in (IDNA 2003 — older standard)
"m\u00fcnchen.de".encode("idna")
# b'xn--mnchen-3ya.de'

b"xn--mnchen-3ya.de".decode("idna")
# 'm\u00fcnchen.de'

# Modern library (IDNA 2008 — recommended)
import idna

idna.encode("m\u00fcnchen.de")
# b'xn--mnchen-3ya.de'

idna.decode(b"xn--mnchen-3ya.de")
# 'm\u00fcnchen.de'

Encoding in JavaScript

// Using the URL API (browser-native IDNA handling)
const url = new URL("https://m\u00fcnchen.de");
console.log(url.hostname);  // "xn--mnchen-3ya.de"

// For display, browsers convert back to Unicode
// The address bar shows: münchen.de

IDNA 2003 vs. IDNA 2008

There are two versions of the IDNA standard, and the differences matter:

Feature IDNA 2003 (RFC 3490) IDNA 2008 (RFC 5891-5895)
Normalization Nameprep (stringprep) Based on Unicode properties
Case mapping Mapped during lookup Protocol-level
ß (German eszett) Mapped to "ss" Valid as itself
ς (Greek final sigma) Mapped to σ Valid as itself
ZWJ / ZWNJ Stripped Allowed in specific contexts
Unicode version Tied to Unicode 3.2 Tracks latest Unicode
Emoji in domains Not addressed Generally disallowed

The key practical difference: in IDNA 2003, straße.de (with ß) was silently converted to strasse.de. In IDNA 2008, straße.de and strasse.de are two distinct domains. This caused real confusion during the transition — someone could register xn--strae-9ra.de (straße) as a different domain from strasse.de.

Modern browsers use a hybrid approach defined by UTS #46 (Unicode IDNA Compatibility Processing), which applies IDNA 2008 rules but with some IDNA 2003 compatibility mappings.

The Homograph Attack Problem

IDN introduced a serious security vulnerability: homograph attacks (also called IDN spoofing). Many Unicode characters from different scripts look identical or nearly identical to ASCII letters:

ASCII Look-alike Script Code Point
a а Cyrillic U+0430
e е Cyrillic U+0435
o о Cyrillic U+043E
p р Cyrillic U+0440
c с Cyrillic U+0441
x х Cyrillic U+0445
a ɑ Latin (IPA) U+0251
g ɡ Latin (IPA) U+0261

An attacker could register аррle.com (using Cyrillic а, р, р) which looks identical to apple.com in many fonts. The Punycode form would be xn--pple-43d0c.com — clearly different in the raw DNS, but visually indistinguishable to users.

Real-World Incidents

In 2017, security researcher Xudong Zheng demonstrated that xn--80ak6aa92e.com could render as apple.com in Chrome's address bar, using a domain composed entirely of Cyrillic characters. This prompted all major browsers to tighten their IDN display policies.

Browser Defenses

Modern browsers use script mixing rules to decide whether to display the Unicode form or fall back to showing the raw Punycode:

  1. Single-script rule: If all characters in a label come from one script (and it is not Cyrillic/Greek mixed with Latin), display as Unicode
  2. Allowlist approach: Chrome maintains a list of TLDs and registrars that enforce their own homograph policies
  3. Confusable detection: Browsers check labels against the Unicode confusables data (confusables.txt) to detect mixed-script spoofing
  4. Punycode fallback: If any check fails, the browser shows xn--... in the address bar

Registrar-Level Protections

Many domain registrars and registries now enforce their own restrictions:

  • .de (Germany): Allows a curated set of Latin characters with diacritics
  • .jp (Japan): Allows only Japanese scripts (Hiragana, Katakana, Kanji)
  • .рф (Russia): Allows only Cyrillic characters
  • .com (Verisign): Allows multiple scripts but blocks known confusable combinations

Top-Level Domains in Unicode

Since 2010, ICANN has allowed internationalized top-level domains (IDN TLDs). Some examples:

IDN TLD Punycode Country/Purpose
.中国 .xn--fiqs8s China
.рф .xn--p1acf Russia
.مصر .xn--wgbh1c Egypt
.ไทย .xn--o3cw4h Thailand
.भारत .xn--h2brj9c India
.السعودية .xn--mgberp4a5d4ar Saudi Arabia

These TLDs mean that for the first time, an entire URL — including the domain and TLD — can be written in a non-Latin script.

Practical Considerations for Developers

Validating IDN Input

When your application accepts domain names from users, you need to handle IDN correctly:

import idna

def validate_domain(domain: str) -> str:
    """Validate and normalize an internationalized domain name."""
    try:
        # Encode to ACE form (validates IDNA 2008 rules)
        ace = idna.encode(domain, uts46=True)
        # Decode back to Unicode for display
        return idna.decode(ace)
    except idna.core.InvalidCodepoint as e:
        raise ValueError(f"Invalid character in domain: {e}")
    except idna.core.InvalidCodepointContext as e:
        raise ValueError(f"Invalid character context: {e}")

Storing Domains

Store both forms: - ACE form (xn--...) for DNS operations, comparisons, and database lookups - Unicode form for display to users

Email Addresses and IDN

Email addresses can use IDN domains (the part after @), but the local part (before @) has its own internationalization standard: EAI (Email Address Internationalization, RFC 6530-6533). Full internationalized email like 用户@中国.中国 requires SMTPUTF8 support, which is still not universal.

Key Takeaways

  1. IDN bridges the gap between Unicode text and the ASCII-only DNS through Punycode
  2. Punycode encodes Unicode labels with the xn-- prefix so DNS resolvers never see non-ASCII bytes
  3. IDNA 2008 is the current standard; UTS #46 provides backward compatibility
  4. Homograph attacks are a real security risk — browsers and registrars have layered defenses but no perfect solution
  5. Always validate domain names through a proper IDNA library, never with simple regex patterns
  6. Store both forms — ACE for machine use, Unicode for human display

Thêm trong Practical Unicode

How to Type Special Characters on Windows

Windows provides several methods for typing special characters and Unicode symbols, including …

How to Type Special Characters on Mac

macOS makes it easy to type special characters and Unicode symbols through …

How to Type Special Characters on Linux

Linux offers multiple ways to insert Unicode characters, including Ctrl+Shift+U followed by …

Special Characters on Mobile (iOS/Android)

Typing special Unicode characters on smartphones requires different techniques than on desktop …

How to Fix Mojibake (Garbled Text)

Mojibake is the garbled text you see when a file encoded in …

Unicode in Databases

Storing Unicode text in a database requires choosing the right charset, collation, …

Unicode in Filenames

Modern operating systems support Unicode filenames, but different filesystems use different encodings …

Unicode in Email

Email evolved from ASCII-only systems, and supporting Unicode in email subjects, bodies, …

Unicode for Accessibility

Using Unicode symbols, special characters, and emoji in web content has important …

Unicode Text Direction: LTR vs RTL

Unicode supports both left-to-right and right-to-left text through the bidirectional algorithm and …

Unicode Fonts: How Characters Get Rendered

A font file only contains glyphs for a subset of Unicode characters, …

How to Find Any Unicode Character

Finding the exact Unicode character you need can be challenging given over …

Unicode Copy and Paste Best Practices

Copying and pasting text between applications can introduce invisible characters, change normalization …

How to Create Fancy Text with Unicode

Unicode's Mathematical Alphanumeric Symbols block and other areas contain bold, italic, script, …