Unicode in Domain Names (IDN)
Internationalized Domain Names (IDNs) allow domain names to contain non-ASCII characters from any Unicode script, converted to ASCII-compatible encoding using the Punycode algorithm. This guide explains how IDNs work, how Punycode conversion functions, and the security risks posed by homograph attacks on international domain names.
The web was born in English. Early domain names were restricted to a tiny set of ASCII characters — letters a-z, digits 0-9, and hyphens. For the billions of internet users whose languages use Arabic, Chinese, Cyrillic, Devanagari, or any other non-Latin script, this restriction meant every domain name was foreign. Internationalized Domain Names (IDN) changed that by allowing Unicode characters in domain names through an encoding layer called Punycode. This guide explains how IDN works under the hood, walks through the encoding process, covers the security risks of homograph attacks, and details the IDNA 2008 rules that govern modern internationalized domains.
The Problem: ASCII-Only Domains
The Domain Name System (DNS) was designed in the 1980s with a strict character set. RFC 1035 defined domain name labels as sequences of ASCII letters, digits, and hyphens, with a maximum length of 63 octets per label. This restriction was baked into DNS protocol packets, resolver libraries, and every piece of network infrastructure in between.
For English speakers, this was invisible. For everyone else, it was a barrier:
| Language | Desired Domain | Status Before IDN |
|---|---|---|
| Chinese | 中国.cn | Impossible |
| Arabic | موقع.عرب | Impossible |
| Russian | пример.ru | Impossible |
| Hindi | उदाहरण.भारत | Impossible |
| Korean | 예시.한국 | Impossible |
| Japanese | 例え.日本 | Impossible |
Users could not type domain names in their own scripts. They had to memorize arbitrary ASCII transliterations, which was like asking English speakers to use domain names written in Cyrillic.
How IDN Works: The Encoding Stack
IDN solves the problem without changing DNS itself. Instead, it adds a presentation layer that converts Unicode domain labels into ASCII-compatible encoding (ACE) labels that DNS can handle natively. The process involves three key standards:
- IDNA (Internationalized Domain Names in Applications) — the overall framework
- Nameprep / UTS #46 — normalization and validation rules
- Punycode — the encoding algorithm (RFC 3492)
The Encoding Pipeline
When you type a Unicode domain name into a browser, the following steps occur:
User types: münchen.de
↓
Step 1: Normalize (NFC) → münchen.de
Step 2: Check validity (IDNA rules)
Step 3: Punycode encode → xn--mnchen-3ya.de
Step 4: DNS lookup on xn--mnchen-3ya.de
Step 5: Display münchen.de to user
The xn-- prefix is the ACE prefix that signals "this label is Punycode-encoded."
Every internationalized label in DNS starts with xn--.
Punycode: The Encoding Algorithm
Punycode (RFC 3492) is a bootstring encoding that represents Unicode code points using only ASCII characters. It works by separating the ASCII characters in a label from the non-ASCII ones, then encoding the positions and values of the non-ASCII characters as a compact ASCII string.
Here are some real-world examples:
| Unicode Label | Punycode (ACE) | Language |
|---|---|---|
| münchen | xn--mnchen-3ya | German |
| рф | xn--p1acf | Russian |
| 中国 | xn--fiqs8s | Chinese (Simplified) |
| ไทย | xn--o3cw4h | Thai |
| مصر | xn--wgbh1c | Arabic (Egypt) |
| България | xn--d1alf | Bulgarian |
Encoding in Python
Python's encodings.idna module and the idna library handle the full pipeline:
# Built-in (IDNA 2003 — older standard)
"m\u00fcnchen.de".encode("idna")
# b'xn--mnchen-3ya.de'
b"xn--mnchen-3ya.de".decode("idna")
# 'm\u00fcnchen.de'
# Modern library (IDNA 2008 — recommended)
import idna
idna.encode("m\u00fcnchen.de")
# b'xn--mnchen-3ya.de'
idna.decode(b"xn--mnchen-3ya.de")
# 'm\u00fcnchen.de'
Encoding in JavaScript
// Using the URL API (browser-native IDNA handling)
const url = new URL("https://m\u00fcnchen.de");
console.log(url.hostname); // "xn--mnchen-3ya.de"
// For display, browsers convert back to Unicode
// The address bar shows: münchen.de
IDNA 2003 vs. IDNA 2008
There are two versions of the IDNA standard, and the differences matter:
| Feature | IDNA 2003 (RFC 3490) | IDNA 2008 (RFC 5891-5895) |
|---|---|---|
| Normalization | Nameprep (stringprep) | Based on Unicode properties |
| Case mapping | Mapped during lookup | Protocol-level |
| ß (German eszett) | Mapped to "ss" | Valid as itself |
| ς (Greek final sigma) | Mapped to σ | Valid as itself |
| ZWJ / ZWNJ | Stripped | Allowed in specific contexts |
| Unicode version | Tied to Unicode 3.2 | Tracks latest Unicode |
| Emoji in domains | Not addressed | Generally disallowed |
The key practical difference: in IDNA 2003, straße.de (with ß) was silently
converted to strasse.de. In IDNA 2008, straße.de and strasse.de are two
distinct domains. This caused real confusion during the transition — someone could register
xn--strae-9ra.de (straße) as a different domain from strasse.de.
Modern browsers use a hybrid approach defined by UTS #46 (Unicode IDNA Compatibility Processing), which applies IDNA 2008 rules but with some IDNA 2003 compatibility mappings.
The Homograph Attack Problem
IDN introduced a serious security vulnerability: homograph attacks (also called IDN spoofing). Many Unicode characters from different scripts look identical or nearly identical to ASCII letters:
| ASCII | Look-alike | Script | Code Point |
|---|---|---|---|
| a | а | Cyrillic | U+0430 |
| e | е | Cyrillic | U+0435 |
| o | о | Cyrillic | U+043E |
| p | р | Cyrillic | U+0440 |
| c | с | Cyrillic | U+0441 |
| x | х | Cyrillic | U+0445 |
| a | ɑ | Latin (IPA) | U+0251 |
| g | ɡ | Latin (IPA) | U+0261 |
An attacker could register аррle.com (using Cyrillic а, р, р)
which looks identical to apple.com in many fonts. The Punycode form would be
xn--pple-43d0c.com — clearly different in the raw DNS, but visually indistinguishable
to users.
Real-World Incidents
In 2017, security researcher Xudong Zheng demonstrated that xn--80ak6aa92e.com could
render as apple.com in Chrome's address bar, using a domain composed entirely of
Cyrillic characters. This prompted all major browsers to tighten their IDN display policies.
Browser Defenses
Modern browsers use script mixing rules to decide whether to display the Unicode form or fall back to showing the raw Punycode:
- Single-script rule: If all characters in a label come from one script (and it is not Cyrillic/Greek mixed with Latin), display as Unicode
- Allowlist approach: Chrome maintains a list of TLDs and registrars that enforce their own homograph policies
- Confusable detection: Browsers check labels against the Unicode confusables data
(
confusables.txt) to detect mixed-script spoofing - Punycode fallback: If any check fails, the browser shows
xn--...in the address bar
Registrar-Level Protections
Many domain registrars and registries now enforce their own restrictions:
.de(Germany): Allows a curated set of Latin characters with diacritics.jp(Japan): Allows only Japanese scripts (Hiragana, Katakana, Kanji).рф(Russia): Allows only Cyrillic characters.com(Verisign): Allows multiple scripts but blocks known confusable combinations
Top-Level Domains in Unicode
Since 2010, ICANN has allowed internationalized top-level domains (IDN TLDs). Some examples:
| IDN TLD | Punycode | Country/Purpose |
|---|---|---|
| .中国 | .xn--fiqs8s | China |
| .рф | .xn--p1acf | Russia |
| .مصر | .xn--wgbh1c | Egypt |
| .ไทย | .xn--o3cw4h | Thailand |
| .भारत | .xn--h2brj9c | India |
| .السعودية | .xn--mgberp4a5d4ar | Saudi Arabia |
These TLDs mean that for the first time, an entire URL — including the domain and TLD — can be written in a non-Latin script.
Practical Considerations for Developers
Validating IDN Input
When your application accepts domain names from users, you need to handle IDN correctly:
import idna
def validate_domain(domain: str) -> str:
"""Validate and normalize an internationalized domain name."""
try:
# Encode to ACE form (validates IDNA 2008 rules)
ace = idna.encode(domain, uts46=True)
# Decode back to Unicode for display
return idna.decode(ace)
except idna.core.InvalidCodepoint as e:
raise ValueError(f"Invalid character in domain: {e}")
except idna.core.InvalidCodepointContext as e:
raise ValueError(f"Invalid character context: {e}")
Storing Domains
Store both forms:
- ACE form (xn--...) for DNS operations, comparisons, and database lookups
- Unicode form for display to users
Email Addresses and IDN
Email addresses can use IDN domains (the part after @), but the local part (before @)
has its own internationalization standard: EAI (Email Address Internationalization,
RFC 6530-6533). Full internationalized email like 用户@中国.中国
requires SMTPUTF8 support, which is still not universal.
Key Takeaways
- IDN bridges the gap between Unicode text and the ASCII-only DNS through Punycode
- Punycode encodes Unicode labels with the
xn--prefix so DNS resolvers never see non-ASCII bytes - IDNA 2008 is the current standard; UTS #46 provides backward compatibility
- Homograph attacks are a real security risk — browsers and registrars have layered defenses but no perfect solution
- Always validate domain names through a proper IDNA library, never with simple regex patterns
- Store both forms — ACE for machine use, Unicode for human display
Practical Unicode의 더 많은 가이드
Windows provides several methods for typing special characters and Unicode symbols, including …
macOS makes it easy to type special characters and Unicode symbols through …
Linux offers multiple ways to insert Unicode characters, including Ctrl+Shift+U followed by …
Typing special Unicode characters on smartphones requires different techniques than on desktop …
Mojibake is the garbled text you see when a file encoded in …
Storing Unicode text in a database requires choosing the right charset, collation, …
Modern operating systems support Unicode filenames, but different filesystems use different encodings …
Email evolved from ASCII-only systems, and supporting Unicode in email subjects, bodies, …
Using Unicode symbols, special characters, and emoji in web content has important …
Unicode supports both left-to-right and right-to-left text through the bidirectional algorithm and …
A font file only contains glyphs for a subset of Unicode characters, …
Finding the exact Unicode character you need can be challenging given over …
Copying and pasting text between applications can introduce invisible characters, change normalization …
Unicode's Mathematical Alphanumeric Symbols block and other areas contain bold, italic, script, …