💻 Unicode in Code

How to Handle Unicode in APIs and JSON

JSON is defined as Unicode text and must be encoded in UTF-8, UTF-16, or UTF-32, but many real-world APIs still produce encoding bugs, garbled characters, and incorrectly escaped sequences. This guide explains how to handle Unicode correctly in REST APIs and JSON, including proper escaping, content-type headers, and validation.

Published 2022-08-15 · Updated 2024-12-09

JSON has become the lingua franca of web APIs, and its Unicode handling is one of its greatest strengths — and one of its most misunderstood features. RFC 8259 (the current JSON specification) mandates UTF-8 as the default encoding, and every JSON parser must handle the full Unicode range. Yet encoding bugs at API boundaries remain one of the most common sources of garbled text in production systems. This guide covers JSON's Unicode model, escape sequences, encoding headers, and best practices for ensuring clean text across API boundaries.

JSON Is UTF-8 (RFC 8259)

The JSON specification has evolved through several RFCs:

RFC	Year	Encoding Rule
RFC 4627	2006	UTF-8, UTF-16, or UTF-32
RFC 7159	2014	UTF-8, UTF-16, or UTF-32 (with UTF-8 as default)
RFC 8259	2017	UTF-8 required for closed ecosystems; other encodings only for legacy

RFC 8259 states: "JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8." In practice, this means: always use UTF-8 for JSON.

A valid JSON document is a sequence of Unicode code points encoded as UTF-8 bytes. The JSON grammar allows any Unicode character in string values except the control characters U+0000 through U+001F, the double quote ("), and the backslash (\), which must be escaped.

JSON Unicode Escape Sequences

JSON defines a Unicode escape syntax using \uXXXX where XXXX is exactly four hexadecimal digits representing a UTF-16 code unit:

{
    "arrow": "\u2192",
    "cafe": "caf\u00E9",
    "cjk": "\u4E2D\u6587",
    "greeting": "\u3053\u3093\u306B\u3061\u306F"
}

This is equivalent to:

{
    "arrow": "→",
    "cafe": "café",
    "cjk": "中文",
    "greeting": "こんにちは"
}

Both forms are valid JSON and must be parsed identically. A conformant JSON parser must accept both literal Unicode characters and \uXXXX escapes.

Supplementary Characters (Emoji, Rare CJK)

Characters above U+FFFF cannot be represented by a single \uXXXX escape because the escape only supports four hex digits (16-bit values). Instead, JSON uses surrogate pairs — the same mechanism as UTF-16:

{
    "snake": "\uD83D\uDC0D",
    "snake_literal": "🐍"
}

The snake emoji (U+1F40D) is encoded as the surrogate pair \uD83D\uDC0D:

High surrogate: \uD83D (0xD83D)
Low surrogate: \uDC0D (0xDC0D)
Decoded code point: 0x10000 + (0x3D * 0x400) + 0x0D = 0x1F40D

A lone surrogate (a \uD83D not followed by a low surrogate) is technically invalid JSON per RFC 8259, but many parsers accept it and produce a replacement character.

Required Escapes

JSON requires these characters to be escaped in strings:

Character	Escape	Code Point
Quotation mark	`\"`	U+0022
Reverse solidus	`\\`	U+005C
Control chars (U+0000–U+001F)	`\uXXXX`	—

Additionally, these convenience escapes are defined:

Escape	Character	Code Point
`\n`	Newline	U+000A
`\r`	Carriage return	U+000D
`\t`	Tab	U+0009
`\b`	Backspace	U+0008
`\f`	Form feed	U+000C
`\/`	Forward slash	U+002F (optional)

Encoding in Different Languages

Python

import json

data = {"name": "日本語", "emoji": "🐍", "arrow": "→"}

# Default: escapes all non-ASCII
print(json.dumps(data))
# {"name": "\u65e5\u672c\u8a9e", "emoji": "\ud83d\udc0d", "arrow": "\u2192"}

# Human-readable: keep Unicode characters as-is
print(json.dumps(data, ensure_ascii=False))
# {"name": "日本語", "emoji": "🐍", "arrow": "→"}

# Reading JSON with Unicode escapes
text = '{"name": "\\u65e5\\u672c\\u8a9e"}'
parsed = json.loads(text)
print(parsed["name"])   # 日本語

JavaScript

const data = { name: "日本語", emoji: "🐍", arrow: "→" };

// JSON.stringify preserves Unicode by default
console.log(JSON.stringify(data));
// {"name":"日本語","emoji":"🐍","arrow":"→"}

// JSON.parse handles \uXXXX escapes automatically
const parsed = JSON.parse('{"name":"\\u65e5\\u672c\\u8a9e"}');
console.log(parsed.name);   // 日本語

Go

import "encoding/json"

type Data struct {
    Name  string `json:"name"`
    Emoji string `json:"emoji"`
}

d := Data{Name: "日本語", Emoji: "🐍"}

// Go's json.Marshal escapes some characters by default
b, _ := json.Marshal(d)
fmt.Println(string(b))
// {"name":"日本語","emoji":"🐍"}

// HTML-safe escaping (< > & are escaped)
// To disable: use json.Encoder with SetEscapeHTML(false)

PHP

$data = ['name' => '日本語', 'emoji' => '🐍'];

// Default: escapes non-ASCII
echo json_encode($data);
// {"name":"\u65e5\u672c\u8a9e","emoji":"\ud83d\udc0d"}

// Keep Unicode readable
echo json_encode($data, JSON_UNESCAPED_UNICODE);
// {"name":"日本語","emoji":"🐍"}

// Combined flags for pretty output
echo json_encode($data, JSON_UNESCAPED_UNICODE | JSON_PRETTY_PRINT);

REST API Best Practices

Content-Type Headers

Always include the charset in your Content-Type header — even though RFC 8259 says UTF-8 is the default, many systems still check for it:

Content-Type: application/json; charset=utf-8

Note: the IANA media type registration for application/json (RFC 6838) states that JSON has no charset parameter. In practice, including charset=utf-8 causes no harm and improves interoperability with clients that expect it.

Accept Headers

Clients should request JSON with:

Accept: application/json

Some APIs also support Accept-Charset, but this header is largely obsolete — modern APIs always serve UTF-8.

Request Bodies

When sending JSON in a request body, ensure your HTTP client sets the encoding correctly:

import requests

data = {"name": "日本語"}

# requests encodes JSON as UTF-8 by default
response = requests.post(
    "https://api.example.com/users",
    json=data,  # automatically sets Content-Type: application/json
)

// fetch API
const response = await fetch("https://api.example.com/users", {
    method: "POST",
    headers: { "Content-Type": "application/json; charset=utf-8" },
    body: JSON.stringify({ name: "日本語" }),
});

BOM (Byte Order Mark)

RFC 8259 explicitly forbids a UTF-8 BOM (the bytes EF BB BF) at the beginning of a JSON document. However, some Windows-based systems produce JSON files with a BOM. Robust API servers should strip a leading BOM before parsing:

def strip_bom(data: bytes) -> bytes:
    if data.startswith(b'\xef\xbb\xbf'):
        return data[3:]
    return data

GraphQL and Unicode

GraphQL (specified at graphql.github.io/graphql-spec) requires all documents to be valid UTF-8 sequences. String values in GraphQL follow JSON's escape rules:

query {
    user(name: "café") {
        bio
    }
}

# Unicode escapes in GraphQL strings
query {
    user(name: "caf\u00E9") {
        bio
    }
}

GraphQL's transport layer (typically HTTP POST with application/json) inherits all of JSON's Unicode handling. The query itself is a UTF-8 string inside a JSON value:

{
    "query": "query { user(name: \"café\") { bio } }",
    "variables": { "name": "日本語" }
}

Webhooks and Unicode

Webhooks — server-to-server HTTP callbacks — are a frequent source of encoding problems because:

The sending server may not set Content-Type correctly
The payload may be encoded as Latin-1 or another encoding instead of UTF-8
Some webhook providers double-encode Unicode escapes

Defensive Parsing

def parse_webhook(request):
    # Check Content-Type for encoding hint
    content_type = request.headers.get("Content-Type", "")

    # Determine encoding
    if "charset=utf-8" in content_type.lower():
        encoding = "utf-8"
    elif "charset=latin-1" in content_type.lower():
        encoding = "latin-1"
    else:
        encoding = "utf-8"  # default assumption

    # Decode body
    body = request.body.decode(encoding)

    # Strip BOM if present
    if body.startswith('\ufeff'):
        body = body[1:]

    return json.loads(body)

Unicode Normalization in APIs

APIs that accept user input should normalize Unicode text to prevent duplicate entries and comparison failures:

import unicodedata

def normalize_input(text: str) -> str:
    # Normalize to NFC before storage.
    return unicodedata.normalize("NFC", text)

# Without normalization:
# "café" (NFC, 4 chars) != "café" (NFD, 5 chars) — different byte sequences
# With normalization:
# Both become "café" (NFC, 4 chars) — identical

API Design Recommendations

Normalize on input — apply NFC normalization when receiving text from clients
Validate encoding — reject requests that are not valid UTF-8
Document your encoding — state "All text fields are UTF-8, NFC normalized" in your API documentation
Use ensure_ascii=False — produce human-readable JSON responses
Test with edge cases — emoji, RTL text, CJK characters, combining characters, zero-width joiners

Common Mistakes

1. Assuming ASCII-Only JSON

# WRONG — breaks on non-ASCII characters
response_text = response.content.decode("ascii")

# CORRECT
response_text = response.content.decode("utf-8")

2. Double-Encoding Unicode

# WRONG — escapes are encoded as literal characters
text = json.dumps({"name": "日本語"})  # contains \u escapes
double = json.dumps({"data": text})     # escapes the backslashes

# The result has \\u65e5 instead of \u65e5

3. Ignoring Surrogate Pairs

# WRONG — lone surrogates in JSON
bad_json = '{"text": "\\ud83d"}'       # lone high surrogate

# Some parsers will accept this; others will throw an error
# Always use complete surrogate pairs or literal characters

4. Mixing Encodings Across Services

A microservices architecture where Service A sends Latin-1, Service B expects UTF-8, and Service C outputs Shift-JIS is a recipe for mojibake. Standardize on UTF-8 everywhere.

Quick Reference

Task	Method
JSON encode (readable)	`json.dumps(data, ensure_ascii=False)`
JSON encode (escaped)	`json.dumps(data)` (default)
Set Content-Type	`Content-Type: application/json; charset=utf-8`
Normalize input	`unicodedata.normalize("NFC", text)`
Validate UTF-8	`text.encode("utf-8").decode("utf-8")` (round-trip)
Strip BOM	Remove leading `\xEF\xBB\xBF` bytes
Handle surrogate pairs	Use library JSON parser (handles automatically)

JSON's UTF-8 foundation makes it an excellent format for international data exchange. The key is consistency: use UTF-8 everywhere, normalize on input, validate at boundaries, and test with text from every script your users might send. Most JSON encoding bugs are not in the format itself — they are in the assumptions developers make about the text flowing through it.

Thêm trong Unicode in Code

Unicode in Python

Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …

Unicode in JavaScript

JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …

Unicode in Java

Java's char type is a 16-bit UTF-16 code unit, not a full …

Unicode in Go

Go's string type is a sequence of bytes, and its rune type …

Unicode in Rust

Rust's str and String types are guaranteed to be valid UTF-8, making …

Unicode in C/C++

C and C++ have historically poor Unicode support, with char being a …

Unicode in Ruby

Ruby strings carry an explicit encoding, with UTF-8 being the default since …

Unicode in PHP

PHP's built-in string functions operate on bytes rather than Unicode characters, which …

Unicode in Swift

Swift's String type is designed with Unicode correctness as a first-class concern, …

Unicode in HTML & CSS

HTML and CSS support Unicode characters directly and through escape sequences, allowing …

Unicode in Regular Expressions

Unicode-aware regular expressions let you match characters by script, category, or property …

Unicode in SQL

SQL databases store text in encodings and collations that determine how characters …

Unicode in URLs

URLs are technically restricted to ASCII characters, so non-ASCII text must be …

Unicode Escape Sequences: Cross-Language Reference

Every major programming language has its own syntax for embedding Unicode characters …

← Quay lại Hướng dẫn