💻 Unicode in Code

How to Handle Unicode in APIs and JSON

JSON is defined as Unicode text and must be encoded in UTF-8, UTF-16, or UTF-32, but many real-world APIs still produce encoding bugs, garbled characters, and incorrectly escaped sequences. This guide explains how to handle Unicode correctly in REST APIs and JSON, including proper escaping, content-type headers, and validation.

·

JSON has become the lingua franca of web APIs, and its Unicode handling is one of its greatest strengths — and one of its most misunderstood features. RFC 8259 (the current JSON specification) mandates UTF-8 as the default encoding, and every JSON parser must handle the full Unicode range. Yet encoding bugs at API boundaries remain one of the most common sources of garbled text in production systems. This guide covers JSON's Unicode model, escape sequences, encoding headers, and best practices for ensuring clean text across API boundaries.

JSON Is UTF-8 (RFC 8259)

The JSON specification has evolved through several RFCs:

RFC Year Encoding Rule
RFC 4627 2006 UTF-8, UTF-16, or UTF-32
RFC 7159 2014 UTF-8, UTF-16, or UTF-32 (with UTF-8 as default)
RFC 8259 2017 UTF-8 required for closed ecosystems; other encodings only for legacy

RFC 8259 states: "JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8." In practice, this means: always use UTF-8 for JSON.

A valid JSON document is a sequence of Unicode code points encoded as UTF-8 bytes. The JSON grammar allows any Unicode character in string values except the control characters U+0000 through U+001F, the double quote ("), and the backslash (\), which must be escaped.

JSON Unicode Escape Sequences

JSON defines a Unicode escape syntax using \uXXXX where XXXX is exactly four hexadecimal digits representing a UTF-16 code unit:

{
    "arrow": "\u2192",
    "cafe": "caf\u00E9",
    "cjk": "\u4E2D\u6587",
    "greeting": "\u3053\u3093\u306B\u3061\u306F"
}

This is equivalent to:

{
    "arrow": "→",
    "cafe": "café",
    "cjk": "中文",
    "greeting": "こんにちは"
}

Both forms are valid JSON and must be parsed identically. A conformant JSON parser must accept both literal Unicode characters and \uXXXX escapes.

Supplementary Characters (Emoji, Rare CJK)

Characters above U+FFFF cannot be represented by a single \uXXXX escape because the escape only supports four hex digits (16-bit values). Instead, JSON uses surrogate pairs — the same mechanism as UTF-16:

{
    "snake": "\uD83D\uDC0D",
    "snake_literal": "🐍"
}

The snake emoji (U+1F40D) is encoded as the surrogate pair \uD83D\uDC0D:

  • High surrogate: \uD83D (0xD83D)
  • Low surrogate: \uDC0D (0xDC0D)
  • Decoded code point: 0x10000 + (0x3D * 0x400) + 0x0D = 0x1F40D

A lone surrogate (a \uD83D not followed by a low surrogate) is technically invalid JSON per RFC 8259, but many parsers accept it and produce a replacement character.

Required Escapes

JSON requires these characters to be escaped in strings:

Character Escape Code Point
Quotation mark \" U+0022
Reverse solidus \\ U+005C
Control chars (U+0000–U+001F) \uXXXX

Additionally, these convenience escapes are defined:

Escape Character Code Point
\n Newline U+000A
\r Carriage return U+000D
\t Tab U+0009
\b Backspace U+0008
\f Form feed U+000C
\/ Forward slash U+002F (optional)

Encoding in Different Languages

Python

import json

data = {"name": "日本語", "emoji": "🐍", "arrow": "→"}

# Default: escapes all non-ASCII
print(json.dumps(data))
# {"name": "\u65e5\u672c\u8a9e", "emoji": "\ud83d\udc0d", "arrow": "\u2192"}

# Human-readable: keep Unicode characters as-is
print(json.dumps(data, ensure_ascii=False))
# {"name": "日本語", "emoji": "🐍", "arrow": "→"}

# Reading JSON with Unicode escapes
text = '{"name": "\\u65e5\\u672c\\u8a9e"}'
parsed = json.loads(text)
print(parsed["name"])   # 日本語

JavaScript

const data = { name: "日本語", emoji: "🐍", arrow: "→" };

// JSON.stringify preserves Unicode by default
console.log(JSON.stringify(data));
// {"name":"日本語","emoji":"🐍","arrow":"→"}

// JSON.parse handles \uXXXX escapes automatically
const parsed = JSON.parse('{"name":"\\u65e5\\u672c\\u8a9e"}');
console.log(parsed.name);   // 日本語

Go

import "encoding/json"

type Data struct {
    Name  string `json:"name"`
    Emoji string `json:"emoji"`
}

d := Data{Name: "日本語", Emoji: "🐍"}

// Go's json.Marshal escapes some characters by default
b, _ := json.Marshal(d)
fmt.Println(string(b))
// {"name":"日本語","emoji":"🐍"}

// HTML-safe escaping (< > & are escaped)
// To disable: use json.Encoder with SetEscapeHTML(false)

PHP

$data = ['name' => '日本語', 'emoji' => '🐍'];

// Default: escapes non-ASCII
echo json_encode($data);
// {"name":"\u65e5\u672c\u8a9e","emoji":"\ud83d\udc0d"}

// Keep Unicode readable
echo json_encode($data, JSON_UNESCAPED_UNICODE);
// {"name":"日本語","emoji":"🐍"}

// Combined flags for pretty output
echo json_encode($data, JSON_UNESCAPED_UNICODE | JSON_PRETTY_PRINT);

REST API Best Practices

Content-Type Headers

Always include the charset in your Content-Type header — even though RFC 8259 says UTF-8 is the default, many systems still check for it:

Content-Type: application/json; charset=utf-8

Note: the IANA media type registration for application/json (RFC 6838) states that JSON has no charset parameter. In practice, including charset=utf-8 causes no harm and improves interoperability with clients that expect it.

Accept Headers

Clients should request JSON with:

Accept: application/json

Some APIs also support Accept-Charset, but this header is largely obsolete — modern APIs always serve UTF-8.

Request Bodies

When sending JSON in a request body, ensure your HTTP client sets the encoding correctly:

import requests

data = {"name": "日本語"}

# requests encodes JSON as UTF-8 by default
response = requests.post(
    "https://api.example.com/users",
    json=data,  # automatically sets Content-Type: application/json
)
// fetch API
const response = await fetch("https://api.example.com/users", {
    method: "POST",
    headers: { "Content-Type": "application/json; charset=utf-8" },
    body: JSON.stringify({ name: "日本語" }),
});

BOM (Byte Order Mark)

RFC 8259 explicitly forbids a UTF-8 BOM (the bytes EF BB BF) at the beginning of a JSON document. However, some Windows-based systems produce JSON files with a BOM. Robust API servers should strip a leading BOM before parsing:

def strip_bom(data: bytes) -> bytes:
    if data.startswith(b'\xef\xbb\xbf'):
        return data[3:]
    return data

GraphQL and Unicode

GraphQL (specified at graphql.github.io/graphql-spec) requires all documents to be valid UTF-8 sequences. String values in GraphQL follow JSON's escape rules:

query {
    user(name: "café") {
        bio
    }
}

# Unicode escapes in GraphQL strings
query {
    user(name: "caf\u00E9") {
        bio
    }
}

GraphQL's transport layer (typically HTTP POST with application/json) inherits all of JSON's Unicode handling. The query itself is a UTF-8 string inside a JSON value:

{
    "query": "query { user(name: \"café\") { bio } }",
    "variables": { "name": "日本語" }
}

Webhooks and Unicode

Webhooks — server-to-server HTTP callbacks — are a frequent source of encoding problems because:

  1. The sending server may not set Content-Type correctly
  2. The payload may be encoded as Latin-1 or another encoding instead of UTF-8
  3. Some webhook providers double-encode Unicode escapes

Defensive Parsing

def parse_webhook(request):
    # Check Content-Type for encoding hint
    content_type = request.headers.get("Content-Type", "")

    # Determine encoding
    if "charset=utf-8" in content_type.lower():
        encoding = "utf-8"
    elif "charset=latin-1" in content_type.lower():
        encoding = "latin-1"
    else:
        encoding = "utf-8"  # default assumption

    # Decode body
    body = request.body.decode(encoding)

    # Strip BOM if present
    if body.startswith('\ufeff'):
        body = body[1:]

    return json.loads(body)

Unicode Normalization in APIs

APIs that accept user input should normalize Unicode text to prevent duplicate entries and comparison failures:

import unicodedata

def normalize_input(text: str) -> str:
    # Normalize to NFC before storage.
    return unicodedata.normalize("NFC", text)

# Without normalization:
# "café" (NFC, 4 chars) != "café" (NFD, 5 chars) — different byte sequences
# With normalization:
# Both become "café" (NFC, 4 chars) — identical

API Design Recommendations

  1. Normalize on input — apply NFC normalization when receiving text from clients
  2. Validate encoding — reject requests that are not valid UTF-8
  3. Document your encoding — state "All text fields are UTF-8, NFC normalized" in your API documentation
  4. Use ensure_ascii=False — produce human-readable JSON responses
  5. Test with edge cases — emoji, RTL text, CJK characters, combining characters, zero-width joiners

Common Mistakes

1. Assuming ASCII-Only JSON

# WRONG — breaks on non-ASCII characters
response_text = response.content.decode("ascii")

# CORRECT
response_text = response.content.decode("utf-8")

2. Double-Encoding Unicode

# WRONG — escapes are encoded as literal characters
text = json.dumps({"name": "日本語"})  # contains \u escapes
double = json.dumps({"data": text})     # escapes the backslashes

# The result has \\u65e5 instead of \u65e5

3. Ignoring Surrogate Pairs

# WRONG — lone surrogates in JSON
bad_json = '{"text": "\\ud83d"}'       # lone high surrogate

# Some parsers will accept this; others will throw an error
# Always use complete surrogate pairs or literal characters

4. Mixing Encodings Across Services

A microservices architecture where Service A sends Latin-1, Service B expects UTF-8, and Service C outputs Shift-JIS is a recipe for mojibake. Standardize on UTF-8 everywhere.

Quick Reference

Task Method
JSON encode (readable) json.dumps(data, ensure_ascii=False)
JSON encode (escaped) json.dumps(data) (default)
Set Content-Type Content-Type: application/json; charset=utf-8
Normalize input unicodedata.normalize("NFC", text)
Validate UTF-8 text.encode("utf-8").decode("utf-8") (round-trip)
Strip BOM Remove leading \xEF\xBB\xBF bytes
Handle surrogate pairs Use library JSON parser (handles automatically)

JSON's UTF-8 foundation makes it an excellent format for international data exchange. The key is consistency: use UTF-8 everywhere, normalize on input, validate at boundaries, and test with text from every script your users might send. Most JSON encoding bugs are not in the format itself — they are in the assumptions developers make about the text flowing through it.

Unicode in Code में और