How to Handle Unicode in APIs and JSON
JSON is defined as Unicode text and must be encoded in UTF-8, UTF-16, or UTF-32, but many real-world APIs still produce encoding bugs, garbled characters, and incorrectly escaped sequences. This guide explains how to handle Unicode correctly in REST APIs and JSON, including proper escaping, content-type headers, and validation.
JSON has become the lingua franca of web APIs, and its Unicode handling is one of its greatest strengths — and one of its most misunderstood features. RFC 8259 (the current JSON specification) mandates UTF-8 as the default encoding, and every JSON parser must handle the full Unicode range. Yet encoding bugs at API boundaries remain one of the most common sources of garbled text in production systems. This guide covers JSON's Unicode model, escape sequences, encoding headers, and best practices for ensuring clean text across API boundaries.
JSON Is UTF-8 (RFC 8259)
The JSON specification has evolved through several RFCs:
| RFC | Year | Encoding Rule |
|---|---|---|
| RFC 4627 | 2006 | UTF-8, UTF-16, or UTF-32 |
| RFC 7159 | 2014 | UTF-8, UTF-16, or UTF-32 (with UTF-8 as default) |
| RFC 8259 | 2017 | UTF-8 required for closed ecosystems; other encodings only for legacy |
RFC 8259 states: "JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8." In practice, this means: always use UTF-8 for JSON.
A valid JSON document is a sequence of Unicode code points encoded as UTF-8
bytes. The JSON grammar allows any Unicode character in string values except the
control characters U+0000 through U+001F, the double quote ("), and the
backslash (\), which must be escaped.
JSON Unicode Escape Sequences
JSON defines a Unicode escape syntax using \uXXXX where XXXX is exactly four
hexadecimal digits representing a UTF-16 code unit:
{
"arrow": "\u2192",
"cafe": "caf\u00E9",
"cjk": "\u4E2D\u6587",
"greeting": "\u3053\u3093\u306B\u3061\u306F"
}
This is equivalent to:
{
"arrow": "→",
"cafe": "café",
"cjk": "中文",
"greeting": "こんにちは"
}
Both forms are valid JSON and must be parsed identically. A conformant JSON
parser must accept both literal Unicode characters and \uXXXX escapes.
Supplementary Characters (Emoji, Rare CJK)
Characters above U+FFFF cannot be represented by a single \uXXXX escape
because the escape only supports four hex digits (16-bit values). Instead, JSON
uses surrogate pairs — the same mechanism as UTF-16:
{
"snake": "\uD83D\uDC0D",
"snake_literal": "🐍"
}
The snake emoji (U+1F40D) is encoded as the surrogate pair \uD83D\uDC0D:
- High surrogate:
\uD83D(0xD83D) - Low surrogate:
\uDC0D(0xDC0D) - Decoded code point:
0x10000 + (0x3D * 0x400) + 0x0D = 0x1F40D
A lone surrogate (a \uD83D not followed by a low surrogate) is technically
invalid JSON per RFC 8259, but many parsers accept it and produce a replacement
character.
Required Escapes
JSON requires these characters to be escaped in strings:
| Character | Escape | Code Point |
|---|---|---|
| Quotation mark | \" |
U+0022 |
| Reverse solidus | \\ |
U+005C |
| Control chars (U+0000–U+001F) | \uXXXX |
— |
Additionally, these convenience escapes are defined:
| Escape | Character | Code Point |
|---|---|---|
\n |
Newline | U+000A |
\r |
Carriage return | U+000D |
\t |
Tab | U+0009 |
\b |
Backspace | U+0008 |
\f |
Form feed | U+000C |
\/ |
Forward slash | U+002F (optional) |
Encoding in Different Languages
Python
import json
data = {"name": "日本語", "emoji": "🐍", "arrow": "→"}
# Default: escapes all non-ASCII
print(json.dumps(data))
# {"name": "\u65e5\u672c\u8a9e", "emoji": "\ud83d\udc0d", "arrow": "\u2192"}
# Human-readable: keep Unicode characters as-is
print(json.dumps(data, ensure_ascii=False))
# {"name": "日本語", "emoji": "🐍", "arrow": "→"}
# Reading JSON with Unicode escapes
text = '{"name": "\\u65e5\\u672c\\u8a9e"}'
parsed = json.loads(text)
print(parsed["name"]) # 日本語
JavaScript
const data = { name: "日本語", emoji: "🐍", arrow: "→" };
// JSON.stringify preserves Unicode by default
console.log(JSON.stringify(data));
// {"name":"日本語","emoji":"🐍","arrow":"→"}
// JSON.parse handles \uXXXX escapes automatically
const parsed = JSON.parse('{"name":"\\u65e5\\u672c\\u8a9e"}');
console.log(parsed.name); // 日本語
Go
import "encoding/json"
type Data struct {
Name string `json:"name"`
Emoji string `json:"emoji"`
}
d := Data{Name: "日本語", Emoji: "🐍"}
// Go's json.Marshal escapes some characters by default
b, _ := json.Marshal(d)
fmt.Println(string(b))
// {"name":"日本語","emoji":"🐍"}
// HTML-safe escaping (< > & are escaped)
// To disable: use json.Encoder with SetEscapeHTML(false)
PHP
$data = ['name' => '日本語', 'emoji' => '🐍'];
// Default: escapes non-ASCII
echo json_encode($data);
// {"name":"\u65e5\u672c\u8a9e","emoji":"\ud83d\udc0d"}
// Keep Unicode readable
echo json_encode($data, JSON_UNESCAPED_UNICODE);
// {"name":"日本語","emoji":"🐍"}
// Combined flags for pretty output
echo json_encode($data, JSON_UNESCAPED_UNICODE | JSON_PRETTY_PRINT);
REST API Best Practices
Content-Type Headers
Always include the charset in your Content-Type header — even though RFC 8259
says UTF-8 is the default, many systems still check for it:
Content-Type: application/json; charset=utf-8
Note: the IANA media type registration for application/json (RFC 6838) states
that JSON has no charset parameter. In practice, including charset=utf-8
causes no harm and improves interoperability with clients that expect it.
Accept Headers
Clients should request JSON with:
Accept: application/json
Some APIs also support Accept-Charset, but this header is largely obsolete —
modern APIs always serve UTF-8.
Request Bodies
When sending JSON in a request body, ensure your HTTP client sets the encoding correctly:
import requests
data = {"name": "日本語"}
# requests encodes JSON as UTF-8 by default
response = requests.post(
"https://api.example.com/users",
json=data, # automatically sets Content-Type: application/json
)
// fetch API
const response = await fetch("https://api.example.com/users", {
method: "POST",
headers: { "Content-Type": "application/json; charset=utf-8" },
body: JSON.stringify({ name: "日本語" }),
});
BOM (Byte Order Mark)
RFC 8259 explicitly forbids a UTF-8 BOM (the bytes EF BB BF) at the beginning
of a JSON document. However, some Windows-based systems produce JSON files with a
BOM. Robust API servers should strip a leading BOM before parsing:
def strip_bom(data: bytes) -> bytes:
if data.startswith(b'\xef\xbb\xbf'):
return data[3:]
return data
GraphQL and Unicode
GraphQL (specified at graphql.github.io/graphql-spec) requires all documents to be valid UTF-8 sequences. String values in GraphQL follow JSON's escape rules:
query {
user(name: "café") {
bio
}
}
# Unicode escapes in GraphQL strings
query {
user(name: "caf\u00E9") {
bio
}
}
GraphQL's transport layer (typically HTTP POST with application/json) inherits
all of JSON's Unicode handling. The query itself is a UTF-8 string inside a JSON
value:
{
"query": "query { user(name: \"café\") { bio } }",
"variables": { "name": "日本語" }
}
Webhooks and Unicode
Webhooks — server-to-server HTTP callbacks — are a frequent source of encoding problems because:
- The sending server may not set
Content-Typecorrectly - The payload may be encoded as Latin-1 or another encoding instead of UTF-8
- Some webhook providers double-encode Unicode escapes
Defensive Parsing
def parse_webhook(request):
# Check Content-Type for encoding hint
content_type = request.headers.get("Content-Type", "")
# Determine encoding
if "charset=utf-8" in content_type.lower():
encoding = "utf-8"
elif "charset=latin-1" in content_type.lower():
encoding = "latin-1"
else:
encoding = "utf-8" # default assumption
# Decode body
body = request.body.decode(encoding)
# Strip BOM if present
if body.startswith('\ufeff'):
body = body[1:]
return json.loads(body)
Unicode Normalization in APIs
APIs that accept user input should normalize Unicode text to prevent duplicate entries and comparison failures:
import unicodedata
def normalize_input(text: str) -> str:
# Normalize to NFC before storage.
return unicodedata.normalize("NFC", text)
# Without normalization:
# "café" (NFC, 4 chars) != "café" (NFD, 5 chars) — different byte sequences
# With normalization:
# Both become "café" (NFC, 4 chars) — identical
API Design Recommendations
- Normalize on input — apply NFC normalization when receiving text from clients
- Validate encoding — reject requests that are not valid UTF-8
- Document your encoding — state "All text fields are UTF-8, NFC normalized" in your API documentation
- Use
ensure_ascii=False— produce human-readable JSON responses - Test with edge cases — emoji, RTL text, CJK characters, combining characters, zero-width joiners
Common Mistakes
1. Assuming ASCII-Only JSON
# WRONG — breaks on non-ASCII characters
response_text = response.content.decode("ascii")
# CORRECT
response_text = response.content.decode("utf-8")
2. Double-Encoding Unicode
# WRONG — escapes are encoded as literal characters
text = json.dumps({"name": "日本語"}) # contains \u escapes
double = json.dumps({"data": text}) # escapes the backslashes
# The result has \\u65e5 instead of \u65e5
3. Ignoring Surrogate Pairs
# WRONG — lone surrogates in JSON
bad_json = '{"text": "\\ud83d"}' # lone high surrogate
# Some parsers will accept this; others will throw an error
# Always use complete surrogate pairs or literal characters
4. Mixing Encodings Across Services
A microservices architecture where Service A sends Latin-1, Service B expects UTF-8, and Service C outputs Shift-JIS is a recipe for mojibake. Standardize on UTF-8 everywhere.
Quick Reference
| Task | Method |
|---|---|
| JSON encode (readable) | json.dumps(data, ensure_ascii=False) |
| JSON encode (escaped) | json.dumps(data) (default) |
| Set Content-Type | Content-Type: application/json; charset=utf-8 |
| Normalize input | unicodedata.normalize("NFC", text) |
| Validate UTF-8 | text.encode("utf-8").decode("utf-8") (round-trip) |
| Strip BOM | Remove leading \xEF\xBB\xBF bytes |
| Handle surrogate pairs | Use library JSON parser (handles automatically) |
JSON's UTF-8 foundation makes it an excellent format for international data exchange. The key is consistency: use UTF-8 everywhere, normalize on input, validate at boundaries, and test with text from every script your users might send. Most JSON encoding bugs are not in the format itself — they are in the assumptions developers make about the text flowing through it.
Thêm trong Unicode in Code
Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …
JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …
Java's char type is a 16-bit UTF-16 code unit, not a full …
Go's string type is a sequence of bytes, and its rune type …
Rust's str and String types are guaranteed to be valid UTF-8, making …
C and C++ have historically poor Unicode support, with char being a …
Ruby strings carry an explicit encoding, with UTF-8 being the default since …
PHP's built-in string functions operate on bytes rather than Unicode characters, which …
Swift's String type is designed with Unicode correctness as a first-class concern, …
HTML and CSS support Unicode characters directly and through escape sequences, allowing …
Unicode-aware regular expressions let you match characters by script, category, or property …
SQL databases store text in encodings and collations that determine how characters …
URLs are technically restricted to ASCII characters, so non-ASCII text must be …
Every major programming language has its own syntax for embedding Unicode characters …