💻 Unicode in Code

Unicode in URLs

URLs are technically restricted to ASCII characters, so non-ASCII text must be percent-encoded as UTF-8 bytes before being included in a URL path or query string. This guide explains percent-encoding, Internationalized Resource Identifiers (IRIs), Punycode for domains, and how to encode and decode Unicode URLs correctly.

·

URLs were designed in the early 1990s for a world that spoke ASCII. The original RFC 1738 (1994) restricted URLs to a small subset of ASCII characters: letters, digits, and a handful of special characters like /, ?, and #. But the web is global, and people need URLs that contain Chinese, Arabic, Cyrillic, emoji, and every other script. This guide explains how Unicode characters travel through the URL system — from the address bar to the server and back.

The Anatomy of a URL

A URL has several components, and each handles Unicode differently:

https://例え.jp/パス/ファイル?検索=値#断片
  │        │      │           │     │
  scheme   host   path      query  fragment
Component Unicode Handling
Scheme ASCII only (https, ftp, etc.)
Host (domain) Punycode encoding (IDNA)
Path Percent-encoding (UTF-8 bytes)
Query Percent-encoding (UTF-8 bytes)
Fragment Percent-encoding (UTF-8 bytes)

URIs vs. IRIs

The distinction between URIs and IRIs is fundamental:

  • URI (Uniform Resource Identifier, RFC 3986) — restricted to ASCII. Non-ASCII characters must be percent-encoded.
  • IRI (Internationalized Resource Identifier, RFC 3987) — extends URIs to allow Unicode characters directly.

In practice, browsers display IRIs in the address bar (showing Unicode characters) but send URIs over the wire (with percent-encoding). The conversion between IRI and URI is well-defined and reversible.

IRI:  https://ko.wikipedia.org/wiki/유니코드
URI:  https://ko.wikipedia.org/wiki/%EC%9C%A0%EB%8B%88%EC%BD%94%EB%93%9C

Percent-Encoding (URL Encoding)

Percent-encoding converts each byte of a UTF-8 sequence into %XX format, where XX is the hexadecimal byte value.

How It Works

  1. Encode the Unicode string as UTF-8 bytes
  2. For each byte that is not an "unreserved" ASCII character, replace it with %XX

Unreserved characters (never encoded): A-Z a-z 0-9 - _ . ~

"café" → UTF-8 bytes: 63 61 66 C3 A9
       → percent-encoded: caf%C3%A9

"日本語" → UTF-8 bytes: E6 97 A5 E6 9C AC E8 AA 9E
         → percent-encoded: %E6%97%A5%E6%9C%AC%E8%AA%9E

"🐍" → UTF-8 bytes: F0 9F 90 8D
     → percent-encoded: %F0%9F%90%8D

In Different Languages

# Python
from urllib.parse import quote, unquote
print(quote("café"))       # "caf%C3%A9"
print(unquote("caf%C3%A9"))  # "café"

# Full path encoding (preserves /)
print(quote("パス/ファイル", safe="/"))   # "%E3%83%91%E3%82%B9/%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB"
// JavaScript
encodeURIComponent("café")        // "caf%C3%A9"
decodeURIComponent("caf%C3%A9")   // "café"

// encodeURI preserves URI structure characters (/, ?, #, etc.)
encodeURI("https://例え.jp/パス")
// "https://%E4%BE%8B%E3%81%88.jp/%E3%83%91%E3%82%B9"
// PHP
echo rawurlencode("café");     // "caf%C3%A9"
echo rawurldecode("caf%C3%A9"); // "café"

// urlencode() uses + for spaces (form encoding), rawurlencode() uses %20
echo urlencode("hello world");     // "hello+world"
echo rawurlencode("hello world");  // "hello%20world"

Double Encoding — A Common Bug

Double encoding happens when an already-encoded string is encoded again:

"café" → first encode:  "caf%C3%A9"
       → double encode: "caf%25C3%25A9"  (% itself gets encoded to %25)

This produces broken URLs. Always check whether your input is already encoded before applying percent-encoding.

Internationalized Domain Names (IDN)

Domain names have stricter rules than URL paths. The Domain Name System (DNS) only supports ASCII labels (letters, digits, hyphens), so Unicode domain names must be converted to an ASCII-compatible encoding called Punycode.

Punycode and the xn-- Prefix

The IDNA (Internationalized Domain Names in Applications) standard defines how Unicode domain names are converted to Punycode:

Unicode domain:  例え.jp
Punycode:        xn--r8jz45g.jp

Unicode domain:  münchen.de
Punycode:        xn--mnchen-3ya.de

Unicode domain:  中文.com
Punycode:        xn--fiq228c.com

The xn-- prefix signals that the label is Punycode-encoded. Each label (separated by dots) is encoded independently.

How Browsers Handle IDN

When you type münchen.de in a browser:

  1. The browser converts each label to Punycode: xn--mnchen-3ya.de
  2. DNS lookup uses the Punycode form
  3. The address bar displays the Unicode form (if the TLD is trusted)

Browsers apply security filtering to IDN display. If a domain mixes scripts (e.g., Cyrillic and Latin characters that look identical), the browser may show the Punycode form instead to prevent phishing:

аpple.com  (Cyrillic "а" + Latin "pple")
→ displayed as: xn--pple-43d.com  (phishing protection)

IDNA 2003 vs. IDNA 2008

Two versions of the IDNA standard exist, and they disagree on some characters:

Character IDNA 2003 IDNA 2008
ß (German sharp s) Mapped to ss Allowed as-is
ς (Greek final sigma) Mapped to σ Allowed as-is
ZWJ / ZWNJ Removed Allowed in specific contexts

Most modern browsers and registrars use IDNA 2008 (via the UTS #46 compatibility mapping), but inconsistencies still exist.

Working with IDN in Code

# Python
import idna
print(idna.encode("münchen.de"))   # b"xn--mnchen-3ya.de"
print(idna.decode("xn--mnchen-3ya.de"))  # "münchen.de"
// JavaScript (URL API handles IDN automatically)
const url = new URL("https://münchen.de/path");
console.log(url.hostname);   // "xn--mnchen-3ya.de" (Punycode)
console.log(url.href);       // "https://xn--mnchen-3ya.de/path"

How Browsers Display Unicode URLs

Modern browsers perform a complex dance with Unicode URLs:

  1. Address bar input: Accept Unicode, convert domain to Punycode, percent-encode path/query
  2. Address bar display: Show Unicode (decoded) for trusted domains, Punycode for suspicious mixed-script domains
  3. Copy from address bar: Some browsers copy the encoded form, others copy the Unicode form (varies by browser and OS)
  4. Link display on hover: Usually shows the percent-encoded form

Example Flow

User types:     https://ko.wikipedia.org/wiki/유니코드
DNS lookup:     ko.wikipedia.org (already ASCII)
HTTP request:   GET /wiki/%EC%9C%A0%EB%8B%88%EC%BD%94%EB%93%9C HTTP/2
Address bar:    https://ko.wikipedia.org/wiki/유니코드  (decoded display)

Email Addresses and Unicode

Email addresses have their own Unicode extension: EAI (Email Address Internationalization, RFC 6531). It allows Unicode in both the local part and the domain:

用户@例え.jp        — fully internationalized
café@münchen.de   — Unicode local part + IDN domain

Support is still limited — many mail servers and services do not accept EAI addresses. Domain-only internationalization (using Punycode) is more widely supported.

SEO and Unicode URLs

Search engines handle Unicode URLs well:

  • Google indexes and ranks Unicode URLs identically to their percent-encoded equivalents
  • Sitemaps can use either form, but the W3C recommends using XML entities for non-ASCII characters
  • Canonical URLs should be consistent — pick either encoded or decoded and use it everywhere
  • Social sharing often breaks on percent-encoded URLs because they look ugly; use Unicode-friendly URL slugs when possible
<!-- Sitemap with Unicode URL -->
<url>
    <loc>https://example.com/%E6%97%A5%E6%9C%AC%E8%AA%9E</loc>
</url>

<!-- Or with IRI (less common, but valid) -->
<url>
    <loc>https://example.com/日本語</loc>
</url>

Common Pitfalls

1. Assuming URLs Are Unicode Strings

URLs on the wire are byte sequences, not Unicode text. Always encode before sending and decode after receiving.

2. Using the Wrong Encoding

Some legacy systems encode URL paths using Latin-1 or Shift-JIS instead of UTF-8. The web standard (WHATWG URL spec) mandates UTF-8 for percent-encoding, but older systems may not comply.

3. Spaces: + vs %20

The + encoding for spaces is specific to application/x-www-form-urlencoded (HTML form submissions). In URL paths, spaces must be %20:

Form query:  ?search=hello+world      (+ for space)
URL path:    /hello%20world            (%20 for space)

4. Fragment Identifiers

The fragment (#section) is never sent to the server — it is processed entirely by the browser. Percent-encoding in fragments is handled client-side.

5. Normalization

The same URL can be written multiple ways:

https://example.com/caf%C3%A9     (percent-encoded)
https://example.com/café           (IRI form)
https://example.com/caf%c3%a9     (lowercase hex)

These are all equivalent. For caching and comparison, normalize to a canonical form (uppercase hex digits, NFC normalization for the decoded text).

Quick Reference

Task Method
Encode path component encodeURIComponent() / urllib.parse.quote()
Decode path component decodeURIComponent() / urllib.parse.unquote()
Encode full URL encodeURI() / urllib.parse.quote(s, safe=":/?#[]@!$&'()*+,;=")
Domain to Punycode idna.encode() / URL API
Punycode to Unicode idna.decode() / browser display
Check for valid URL new URL(str) / urllib.parse.urlparse()

The URL system's ASCII legacy is a historical constraint, not a permanent limitation. Modern standards (IRI, IDNA 2008, WHATWG URL) and modern browsers work together to make Unicode URLs seamless for users while maintaining backward compatibility with the ASCII-only DNS and HTTP infrastructure.

Mais em Unicode in Code