Unicode in URLs
URLs are technically restricted to ASCII characters, so non-ASCII text must be percent-encoded as UTF-8 bytes before being included in a URL path or query string. This guide explains percent-encoding, Internationalized Resource Identifiers (IRIs), Punycode for domains, and how to encode and decode Unicode URLs correctly.
URLs were designed in the early 1990s for a world that spoke ASCII. The original
RFC 1738 (1994) restricted URLs to a small subset of ASCII characters: letters,
digits, and a handful of special characters like /, ?, and #. But the web
is global, and people need URLs that contain Chinese, Arabic, Cyrillic, emoji,
and every other script. This guide explains how Unicode characters travel through
the URL system — from the address bar to the server and back.
The Anatomy of a URL
A URL has several components, and each handles Unicode differently:
https://例え.jp/パス/ファイル?検索=値#断片
│ │ │ │ │
scheme host path query fragment
| Component | Unicode Handling |
|---|---|
| Scheme | ASCII only (https, ftp, etc.) |
| Host (domain) | Punycode encoding (IDNA) |
| Path | Percent-encoding (UTF-8 bytes) |
| Query | Percent-encoding (UTF-8 bytes) |
| Fragment | Percent-encoding (UTF-8 bytes) |
URIs vs. IRIs
The distinction between URIs and IRIs is fundamental:
- URI (Uniform Resource Identifier, RFC 3986) — restricted to ASCII. Non-ASCII characters must be percent-encoded.
- IRI (Internationalized Resource Identifier, RFC 3987) — extends URIs to allow Unicode characters directly.
In practice, browsers display IRIs in the address bar (showing Unicode characters) but send URIs over the wire (with percent-encoding). The conversion between IRI and URI is well-defined and reversible.
IRI: https://ko.wikipedia.org/wiki/유니코드
URI: https://ko.wikipedia.org/wiki/%EC%9C%A0%EB%8B%88%EC%BD%94%EB%93%9C
Percent-Encoding (URL Encoding)
Percent-encoding converts each byte of a UTF-8 sequence into %XX format, where
XX is the hexadecimal byte value.
How It Works
- Encode the Unicode string as UTF-8 bytes
- For each byte that is not an "unreserved" ASCII character, replace it with
%XX
Unreserved characters (never encoded): A-Z a-z 0-9 - _ . ~
"café" → UTF-8 bytes: 63 61 66 C3 A9
→ percent-encoded: caf%C3%A9
"日本語" → UTF-8 bytes: E6 97 A5 E6 9C AC E8 AA 9E
→ percent-encoded: %E6%97%A5%E6%9C%AC%E8%AA%9E
"🐍" → UTF-8 bytes: F0 9F 90 8D
→ percent-encoded: %F0%9F%90%8D
In Different Languages
# Python
from urllib.parse import quote, unquote
print(quote("café")) # "caf%C3%A9"
print(unquote("caf%C3%A9")) # "café"
# Full path encoding (preserves /)
print(quote("パス/ファイル", safe="/")) # "%E3%83%91%E3%82%B9/%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB"
// JavaScript
encodeURIComponent("café") // "caf%C3%A9"
decodeURIComponent("caf%C3%A9") // "café"
// encodeURI preserves URI structure characters (/, ?, #, etc.)
encodeURI("https://例え.jp/パス")
// "https://%E4%BE%8B%E3%81%88.jp/%E3%83%91%E3%82%B9"
// PHP
echo rawurlencode("café"); // "caf%C3%A9"
echo rawurldecode("caf%C3%A9"); // "café"
// urlencode() uses + for spaces (form encoding), rawurlencode() uses %20
echo urlencode("hello world"); // "hello+world"
echo rawurlencode("hello world"); // "hello%20world"
Double Encoding — A Common Bug
Double encoding happens when an already-encoded string is encoded again:
"café" → first encode: "caf%C3%A9"
→ double encode: "caf%25C3%25A9" (% itself gets encoded to %25)
This produces broken URLs. Always check whether your input is already encoded before applying percent-encoding.
Internationalized Domain Names (IDN)
Domain names have stricter rules than URL paths. The Domain Name System (DNS) only supports ASCII labels (letters, digits, hyphens), so Unicode domain names must be converted to an ASCII-compatible encoding called Punycode.
Punycode and the xn-- Prefix
The IDNA (Internationalized Domain Names in Applications) standard defines how Unicode domain names are converted to Punycode:
Unicode domain: 例え.jp
Punycode: xn--r8jz45g.jp
Unicode domain: münchen.de
Punycode: xn--mnchen-3ya.de
Unicode domain: 中文.com
Punycode: xn--fiq228c.com
The xn-- prefix signals that the label is Punycode-encoded. Each label
(separated by dots) is encoded independently.
How Browsers Handle IDN
When you type münchen.de in a browser:
- The browser converts each label to Punycode:
xn--mnchen-3ya.de - DNS lookup uses the Punycode form
- The address bar displays the Unicode form (if the TLD is trusted)
Browsers apply security filtering to IDN display. If a domain mixes scripts (e.g., Cyrillic and Latin characters that look identical), the browser may show the Punycode form instead to prevent phishing:
аpple.com (Cyrillic "а" + Latin "pple")
→ displayed as: xn--pple-43d.com (phishing protection)
IDNA 2003 vs. IDNA 2008
Two versions of the IDNA standard exist, and they disagree on some characters:
| Character | IDNA 2003 | IDNA 2008 |
|---|---|---|
| ß (German sharp s) | Mapped to ss |
Allowed as-is |
| ς (Greek final sigma) | Mapped to σ | Allowed as-is |
| ZWJ / ZWNJ | Removed | Allowed in specific contexts |
Most modern browsers and registrars use IDNA 2008 (via the UTS #46 compatibility mapping), but inconsistencies still exist.
Working with IDN in Code
# Python
import idna
print(idna.encode("münchen.de")) # b"xn--mnchen-3ya.de"
print(idna.decode("xn--mnchen-3ya.de")) # "münchen.de"
// JavaScript (URL API handles IDN automatically)
const url = new URL("https://münchen.de/path");
console.log(url.hostname); // "xn--mnchen-3ya.de" (Punycode)
console.log(url.href); // "https://xn--mnchen-3ya.de/path"
How Browsers Display Unicode URLs
Modern browsers perform a complex dance with Unicode URLs:
- Address bar input: Accept Unicode, convert domain to Punycode, percent-encode path/query
- Address bar display: Show Unicode (decoded) for trusted domains, Punycode for suspicious mixed-script domains
- Copy from address bar: Some browsers copy the encoded form, others copy the Unicode form (varies by browser and OS)
- Link display on hover: Usually shows the percent-encoded form
Example Flow
User types: https://ko.wikipedia.org/wiki/유니코드
DNS lookup: ko.wikipedia.org (already ASCII)
HTTP request: GET /wiki/%EC%9C%A0%EB%8B%88%EC%BD%94%EB%93%9C HTTP/2
Address bar: https://ko.wikipedia.org/wiki/유니코드 (decoded display)
Email Addresses and Unicode
Email addresses have their own Unicode extension: EAI (Email Address Internationalization, RFC 6531). It allows Unicode in both the local part and the domain:
用户@例え.jp — fully internationalized
café@münchen.de — Unicode local part + IDN domain
Support is still limited — many mail servers and services do not accept EAI addresses. Domain-only internationalization (using Punycode) is more widely supported.
SEO and Unicode URLs
Search engines handle Unicode URLs well:
- Google indexes and ranks Unicode URLs identically to their percent-encoded equivalents
- Sitemaps can use either form, but the W3C recommends using XML entities for non-ASCII characters
- Canonical URLs should be consistent — pick either encoded or decoded and use it everywhere
- Social sharing often breaks on percent-encoded URLs because they look ugly; use Unicode-friendly URL slugs when possible
<!-- Sitemap with Unicode URL -->
<url>
<loc>https://example.com/%E6%97%A5%E6%9C%AC%E8%AA%9E</loc>
</url>
<!-- Or with IRI (less common, but valid) -->
<url>
<loc>https://example.com/日本語</loc>
</url>
Common Pitfalls
1. Assuming URLs Are Unicode Strings
URLs on the wire are byte sequences, not Unicode text. Always encode before sending and decode after receiving.
2. Using the Wrong Encoding
Some legacy systems encode URL paths using Latin-1 or Shift-JIS instead of UTF-8. The web standard (WHATWG URL spec) mandates UTF-8 for percent-encoding, but older systems may not comply.
3. Spaces: + vs %20
The + encoding for spaces is specific to application/x-www-form-urlencoded
(HTML form submissions). In URL paths, spaces must be %20:
Form query: ?search=hello+world (+ for space)
URL path: /hello%20world (%20 for space)
4. Fragment Identifiers
The fragment (#section) is never sent to the server — it is processed entirely
by the browser. Percent-encoding in fragments is handled client-side.
5. Normalization
The same URL can be written multiple ways:
https://example.com/caf%C3%A9 (percent-encoded)
https://example.com/café (IRI form)
https://example.com/caf%c3%a9 (lowercase hex)
These are all equivalent. For caching and comparison, normalize to a canonical form (uppercase hex digits, NFC normalization for the decoded text).
Quick Reference
| Task | Method |
|---|---|
| Encode path component | encodeURIComponent() / urllib.parse.quote() |
| Decode path component | decodeURIComponent() / urllib.parse.unquote() |
| Encode full URL | encodeURI() / urllib.parse.quote(s, safe=":/?#[]@!$&'()*+,;=") |
| Domain to Punycode | idna.encode() / URL API |
| Punycode to Unicode | idna.decode() / browser display |
| Check for valid URL | new URL(str) / urllib.parse.urlparse() |
The URL system's ASCII legacy is a historical constraint, not a permanent limitation. Modern standards (IRI, IDNA 2008, WHATWG URL) and modern browsers work together to make Unicode URLs seamless for users while maintaining backward compatibility with the ASCII-only DNS and HTTP infrastructure.
Lainnya di Unicode in Code
Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …
JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …
Java's char type is a 16-bit UTF-16 code unit, not a full …
Go's string type is a sequence of bytes, and its rune type …
Rust's str and String types are guaranteed to be valid UTF-8, making …
C and C++ have historically poor Unicode support, with char being a …
Ruby strings carry an explicit encoding, with UTF-8 being the default since …
PHP's built-in string functions operate on bytes rather than Unicode characters, which …
Swift's String type is designed with Unicode correctness as a first-class concern, …
HTML and CSS support Unicode characters directly and through escape sequences, allowing …
Unicode-aware regular expressions let you match characters by script, category, or property …
SQL databases store text in encodings and collations that determine how characters …
Every major programming language has its own syntax for embedding Unicode characters …
JSON is defined as Unicode text and must be encoded in UTF-8, …