Unicode for the Modern Web · Bab 4
APIs and Unicode: JSON, URLs, and Headers
JSON's \uXXXX escapes, percent-encoding in URLs, IDN, Content-Type charset headers — APIs have many Unicode touchpoints. This chapter provides a comprehensive guide to handling Unicode in HTTP APIs.
APIs cross boundaries — between processes, machines, organizations, and programming languages. At each boundary, characters must be serialized into bytes and deserialized back. Get the encoding wrong at any step and data is silently corrupted. This chapter covers the Unicode requirements of the most common API formats and transport mechanisms: JSON, URLs, HTTP headers, email, GraphQL, and WebSockets.
JSON and Unicode
JSON is defined by RFC 8259 (2017), which mandates UTF-8 as the sole encoding for exchanged JSON:
"JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8."
Despite this, JSON has a native escape mechanism: \\uXXXX, a backslash followed by four hexadecimal digits representing a UTF-16 code unit:
{
"greeting": "Hello \\u4e16\\u754c",
"emoji_name": "Grinning Face",
"emoji": "\\uD83D\\uDE00"
}
The \\uXXXX form can only represent code points U+0000–U+FFFF (the BMP). Supplementary characters require surrogate pairs in JSON:
{
"emoji": "\\uD83D\\uDE00"
}
\\uD83D\\uDE00 is the JSON surrogate pair encoding for U+1F600 (😀). This is a historical artifact — JSON predates the Unicode consortium's discouragement of surrogate pair usage in text formats. RFC 8259 accepts but discourages lone surrogates.
In practice: always use literal UTF-8 characters in JSON when possible. The \\uXXXX escape is only necessary for control characters (U+0000–U+001F) and can be useful for debugging:
import json
data = {"message": "Hello 😀", "city": "São Paulo"}
# Compact, human-readable (recommended)
json.dumps(data, ensure_ascii=False)
# '{"message": "Hello 😀", "city": "São Paulo"}'
# ASCII-safe escape (legacy compatibility)
json.dumps(data, ensure_ascii=True)
# '{"message": "Hello \\ud83d\\ude00", "city": "S\\u00e3o Paulo"}'
Python's json.dumps defaults to ensure_ascii=True for historical reasons. Set ensure_ascii=False for readable, compact JSON unless you need ASCII-only output.
URL Encoding: Percent-Encoding and IDNs
URLs are defined by RFC 3986 and allow only a specific set of ASCII characters in their structure. Non-ASCII characters and "reserved" ASCII characters that appear in query values must be percent-encoded:
// encodeURI — encodes everything except URI structural characters
encodeURI("https://example.com/path?q=Hello 世界")
// "https://example.com/path?q=Hello%20%E4%B8%96%E7%95%8C"
// encodeURIComponent — encodes everything except unreserved chars (use for values)
encodeURIComponent("Hello 世界 & more")
// "Hello%20%E4%B8%96%E7%95%8C%20%26%20more"
// Do NOT use escape() — it uses outdated Latin-1 encoding
escape("€") // "%u20AC" — non-standard, incorrect
encodeURIComponent("€") // "%E2%82%AC" — correct UTF-8 percent-encoding
The rule is simple: percent-encoding converts the character's UTF-8 byte sequence to %XX per byte. The string 世界 in UTF-8 is E4 B8 96 E7 95 8C, encoded as %E4%B8%96%E7%95%8C.
Internationalized Domain Names (IDN) use Punycode to encode non-ASCII hostnames into ASCII-compatible form for DNS:
münchen.de → xn--mnchen-3ya.de
日本語.jp → xn--wgv71a309e.jp
emoji.pizza🍕 → emoji.xn--pizza-h2d.
Modern browsers display the Unicode form in the address bar but use Punycode internally. When constructing URLs in code, use a URL library that handles IDN normalization:
const url = new URL("https://münchen.de/path");
console.log(url.hostname); // "xn--mnchen-3ya.de" — automatically Punycode-encoded
console.log(url.href); // "https://xn--mnchen-3ya.de/path"
from urllib.parse import urlparse, quote
import encodings.idna
# Python's http.client handles IDN automatically
# For manual encoding:
hostname = "münchen.de".encode('idna').decode('ascii')
# 'xn--mnchen-3ya.de'
HTTP Headers and Charset
HTTP/1.1 headers are defined in RFC 7230 as sequences of printable US-ASCII characters (0x21–0x7E) and horizontal tabs/spaces. Non-ASCII characters are technically illegal in HTTP headers.
In practice there are two encoding strategies for non-ASCII in HTTP header values:
RFC 5987 / RFC 8187 encoding (charset'language'percent-encoded):
Content-Disposition: attachment; filename*=UTF-8''%C3%A9t%C3%A9.pdf
MIME encoded-words (RFC 2047) — used in email, sometimes in HTTP:
Subject: =?UTF-8?B?SGVsbG8gV29ybGQ=?=
For the Content-Type header, charset is a parameter:
Content-Type: text/html; charset=UTF-8
Content-Type: application/json; charset=UTF-8
Note: application/json technically does not accept a charset parameter per RFC 8259 (UTF-8 is the only option), but many servers send it anyway for compatibility.
Email Headers: MIME Encoding (RFC 2047)
Email headers predate Unicode and must be ASCII. Non-ASCII content is encoded using MIME encoded-words:
=?charset?encoding?encoded_text?=
Two encodings are available:
- B — Base64
- Q — Quoted-Printable (similar to percent-encoding but with _ for space)
# "Subject: Hello World" → same ASCII, no encoding needed
Subject: Hello World
# "Subject: こんにちは" (Japanese greeting)
Subject: =?UTF-8?B?44GT44KT44Gr44Gh44Gv?=
# "Subject: Héllo" — Q encoding
Subject: =?UTF-8?Q?H=C3=A9llo?=
In Python, the email module handles this automatically:
from email.header import Header
from email.mime.text import MIMEText
msg = MIMEText("Body text", 'plain', 'utf-8')
msg['Subject'] = Header("こんにちは、World", 'utf-8')
msg['From'] = "[email protected]"
Email body content uses MIME Content-Transfer-Encoding. For UTF-8 bodies, use quoted-printable or base64:
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Hello =E4=B8=96=E7=95=8C
Modern email clients support Content-Transfer-Encoding: 8bit for UTF-8 bodies when the SMTP server advertises the 8BITMIME extension, allowing literal UTF-8 without encoding.
GraphQL and Unicode
GraphQL's type system uses a Unicode source definition: the specification defines the source document as a sequence of Unicode code points. String values in GraphQL use JSON-style \\uXXXX escapes:
query {
search(query: "caf\\u00E9") {
results {
name
}
}
}
GraphQL responses are JSON, inheriting all of JSON's Unicode considerations. When building resolvers, normalize strings before comparison and be aware of emoji in user-supplied text:
// GraphQL resolver
const resolvers = {
Query: {
searchUsers: (_, { query }) => {
// Normalize before comparison
const normalized = query.normalize('NFC');
return db.users.filter(u =>
u.name.normalize('NFC').includes(normalized)
);
}
}
};
WebSocket Text Frames
The WebSocket protocol (RFC 6455) distinguishes text frames from binary frames. Text frames MUST contain valid UTF-8 — the spec requires that non-conforming UTF-8 text frames cause the connection to be closed with code 1007 (Invalid frame payload data):
const ws = new WebSocket('wss://example.com/ws');
// Text frame — must be valid UTF-8 (JS strings are automatically encoded)
ws.send("Hello 😀 世界");
// Binary frame — use ArrayBuffer or Blob
const encoder = new TextEncoder();
const bytes = encoder.encode("Hello 😀");
ws.send(bytes.buffer); // sends as binary frame
// Receiving
ws.onmessage = (event) => {
if (typeof event.data === 'string') {
// Text frame — already decoded from UTF-8 to JS string
console.log(event.data);
} else {
// Binary frame — decode manually if needed
const decoder = new TextDecoder('utf-8');
console.log(decoder.decode(event.data));
}
};
On the server side (Node.js with ws library):
const { WebSocketServer } = require('ws');
const wss = new WebSocketServer({ port: 8080 });
wss.on('connection', (ws) => {
ws.on('message', (data, isBinary) => {
if (!isBinary) {
// data is a Buffer containing UTF-8 bytes
const text = data.toString('utf8');
console.log(text);
}
});
// Send text — ws library handles UTF-8 encoding
ws.send("Response 😀");
});
The TextEncoder and TextDecoder Web APIs (also available in Node.js) provide the canonical way to convert between JavaScript strings and UTF-8 byte arrays across all these contexts.