🖥️ Platform Guides

Unicode in XML and JSON

Both XML and JSON are defined to use Unicode text, but each has its own rules for encoding characters, escaping special code points, and declaring the document encoding. This guide explains Unicode in XML (including the XML declaration and character references) and JSON (including \uXXXX escape sequences and surrogate pair handling).

Published 2024-06-17 · Updated 2025-09-15

XML and JSON are the two dominant data interchange formats on the web and in enterprise systems. Both have explicit rules for how Unicode text is represented, escaped, and transmitted. Understanding these rules is critical for building systems that correctly handle multilingual data, special characters, and emoji. This guide covers the Unicode encoding requirements of XML and JSON, their escape mechanisms, and the common pitfalls developers encounter.

XML and Unicode

XML's encoding declaration

XML 1.0 supports UTF-8 and UTF-16 as its primary encodings. The encoding is declared in the XML prologue:

<?xml version="1.0" encoding="UTF-8"?>

If no encoding is declared, XML parsers assume UTF-8 (or UTF-16 if a BOM is present).

Encoding Declaration	Parser Behavior
`encoding="UTF-8"`	Parse as UTF-8
`encoding="UTF-16"`	Parse as UTF-16
`encoding="ISO-8859-1"`	Parse as Latin-1
No declaration + no BOM	Assume UTF-8
No declaration + UTF-16 BOM	Parse as UTF-16

Best practice: Always declare encoding="UTF-8" and actually encode the file as UTF-8. This eliminates ambiguity and maximizes compatibility.

Character references in XML

XML provides two types of character references for inserting Unicode characters that cannot (or should not) be typed directly:

Type	Syntax	Example	Result
Decimal	`&#dddd;`	`©`	(c) (copyright)
Hexadecimal	`&#xhhhh;`	`©`	(c) (copyright)
Named entity	`&name;`	`&`	&

Decimal and hex references can express any Unicode code point:

Reference	Code Point	Character
`A`	U+0041	A
`☃`	U+2603	Snowman
`😀`	U+1F600	Grinning face
`中`	U+4E2D	Chinese "middle"

Predefined XML entities

XML defines only five named entities:

Entity	Character	When Required
`<`	<	Always in text content
`>`	>	Recommended in text content
`&`	&	Always
`'`	'	In attribute values with single quotes
`"`	"	In attribute values with double quotes

Unlike HTML, XML does not define entities like © or —. You must use numeric references (©, —) or define custom entities in a DTD.

Restricted characters in XML

XML 1.0 restricts which Unicode characters can appear, even with escaping:

Range	XML 1.0	XML 1.1
U+0000 (null)	Forbidden	Forbidden
U+0001-U+0008	Forbidden	Allowed (as references)
U+0009 (tab)	Allowed	Allowed
U+000A (line feed)	Allowed	Allowed
U+000B-U+000C	Forbidden	Allowed (as references)
U+000D (carriage return)	Allowed	Allowed
U+000E-U+001F	Forbidden	Allowed (as references)
U+0020-U+D7FF	Allowed	Allowed
U+D800-U+DFFF	Forbidden (surrogates)	Forbidden
U+E000-U+FFFD	Allowed	Allowed
U+FFFE-U+FFFF	Forbidden	Forbidden
U+10000-U+10FFFF	Allowed	Allowed

XML 1.1 relaxes restrictions on control characters (allowing them as numeric references) but is rarely used in practice. Most systems use XML 1.0.

CDATA sections

CDATA sections let you include text that would otherwise need escaping:

<code><![CDATA[
    if (a < b && c > d) {
        // No escaping needed inside CDATA
    }
]]></code>

CDATA sections contain raw text — no character references are processed inside them. The only sequence that cannot appear inside CDATA is ]]> (which ends the section).

JSON and Unicode

JSON's UTF-8 mandate

RFC 8259 (the current JSON standard, published 2017) states clearly:

"JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8."

Earlier versions (RFC 4627, RFC 7159) allowed UTF-8, UTF-16, and UTF-32. The current standard mandates UTF-8 for interoperability.

Standard	Allowed Encodings
RFC 4627 (2006)	UTF-8, UTF-16, UTF-32
RFC 7159 (2014)	UTF-8, UTF-16, UTF-32
RFC 8259 (2017)	UTF-8 (MUST for interchange)

Best practice: Always produce and consume JSON as UTF-8. If you encounter UTF-16 or UTF-32 JSON, convert to UTF-8 before processing.

Unicode escape sequences in JSON

JSON defines a single escape mechanism for Unicode characters:

\uXXXX

where XXXX is exactly four hexadecimal digits representing a UTF-16 code unit:

Escape	Code Point	Character
`\u0041`	U+0041	A
`\u00A9`	U+00A9	Copyright sign
`\u2603`	U+2603	Snowman
`\u4E2D`	U+4E2D	Chinese "middle"

Supplementary characters (surrogate pairs)

Characters above U+FFFF cannot be represented in a single \uXXXX escape because they need more than 4 hex digits. JSON uses UTF-16 surrogate pairs:

\uD83D\uDE00  =  U+1F600 (Grinning Face)

The surrogate pair encoding works as follows: 1. Subtract 0x10000 from the code point: 0x1F600 - 0x10000 = 0xF600 2. High surrogate: 0xD800 + (0xF600 >> 10) = 0xD800 + 0x3D = 0xD83D 3. Low surrogate: 0xDC00 + (0xF600 & 0x3FF) = 0xDC00 + 0x200 = 0xDE00

Code Point	Surrogate Pair	Character
U+1F600	`\uD83D\uDE00`	Grinning face
U+1F4A9	`\uD83D\uDCA9`	Pile of poo
U+10348	`\uD800\uDF48`	Gothic letter hwair
U+20000	`\uD840\uDC00`	CJK Unified Ideograph extension B

Important: A lone surrogate (\uD83D without a following low surrogate) is technically invalid JSON per RFC 8259 but is tolerated by many parsers.

Required escapes in JSON strings

JSON requires escaping for these characters:

Character	Escape	Notes
`"` (quotation mark)	`\"`	String delimiter
`\` (backslash)	`\\`	Escape character
Control chars U+0000-U+001F	`\uXXXX`	Must be escaped
`/` (solidus)	`\/`	Optional but allowed
Backspace (U+0008)	`\b`	Shorthand
Form feed (U+000C)	`\f`	Shorthand
Line feed (U+000A)	`\n`	Shorthand
Carriage return (U+000D)	`\r`	Shorthand
Tab (U+0009)	`\t`	Shorthand

All other Unicode characters (including CJK, emoji, accented letters) can appear literally in JSON strings — they do not need escaping if the file is UTF-8 encoded.

Escaping vs literal characters

Both of these JSON strings are valid and semantically identical:

{"city": "Munchen"}
{"city": "M\u00FCnchen"}

The first uses a literal UTF-8 character; the second uses a \u escape. Prefer literal characters for readability, and let your JSON library handle escaping for control characters.

Comparing XML and JSON Unicode Handling

Feature	XML	JSON
Default encoding	UTF-8 (assumed)	UTF-8 (mandated by RFC 8259)
Encoding declaration	Yes (`encoding="..."`)	No (always UTF-8)
Character references	`&#dddd;` / `&#xhhhh;`	`\uXXXX`
Supplementary characters	Direct: `😀`	Surrogate pairs: `\uD83D\uDE00`
Named entities	5 predefined + DTD-defined	None
Control characters	Mostly forbidden	Must be escaped (`\uXXXX`)
BOM	Allowed (UTF-16 detection)	Allowed but discouraged

Key difference: supplementary character handling

XML's approach is simpler — you can reference any code point directly with 😀. JSON requires UTF-16 surrogate pairs for code points above U+FFFF, which is more complex and error-prone. In practice, most JSON libraries handle this transparently.

Common Pitfalls

Pitfall 1: Double encoding

{"name": "M\\u00FCnchen"}

This is double-escaped — the backslash is escaped, so the parser sees the literal string M\u00FCnchen instead of Munchen. This usually happens when a JSON string is encoded twice (e.g., json.dumps(json.dumps(data))).

Pitfall 2: XML with wrong encoding declaration

<?xml version="1.0" encoding="ISO-8859-1"?>
<name>Munchen</name>

If the file is actually UTF-8 but declares ISO-8859-1, multibyte UTF-8 sequences will be misinterpreted, producing garbled text.

Pitfall 3: Lone surrogates in JSON

{"emoji": "\uD83D"}

A lone high surrogate without a matching low surrogate is invalid. Most parsers reject this, but some (Python's json module with default settings) may produce a string containing an unpaired surrogate, which causes problems downstream.

Pitfall 4: Null bytes

Both XML and JSON forbid the null byte (U+0000). If your data contains null bytes (e.g., from binary data incorrectly treated as text), serialization will fail or produce corrupt output.

Best Practices

Always use UTF-8: For both XML and JSON, UTF-8 is the universal standard.
Validate encoding at boundaries: When receiving XML/JSON from external sources, validate that the declared encoding matches the actual byte encoding.
Let libraries handle escaping: Never manually construct XML or JSON strings with string concatenation. Use proper serialization libraries.
Normalize Unicode: Apply NFC normalization before serialization if you need consistent comparison (see the Unicode normalization guide).
Test with diverse characters: Include CJK, emoji, RTL text, and combining characters in your test data.

Key Takeaways

XML supports character references (&#xhhhh;) for any Unicode code point directly, while JSON requires UTF-16 surrogate pairs (\uD83D\uDE00) for characters above U+FFFF.
JSON must be UTF-8 per RFC 8259. XML defaults to UTF-8 but allows other encodings via the encoding declaration.
XML forbids most control characters; JSON requires them to be escaped but permits them.
Both formats handle the full range of Unicode when used correctly. The most common errors are encoding mismatches, double escaping, and lone surrogates.
Use proper serialization libraries — never hand-craft XML or JSON string escaping.

More in Platform Guides

Unicode in Microsoft Word

Microsoft Word supports the full Unicode character set and provides several methods …

Unicode in Google Docs & Sheets

Google Docs and Sheets use UTF-8 internally and provide a Special Characters …

Unicode in Terminal / Command Line

Modern terminals support Unicode and UTF-8, but correctly displaying all Unicode characters …

Unicode in PDF Documents

PDF supports Unicode text through embedded fonts and ToUnicode maps, but many …

Unicode in Excel

Microsoft Excel stores text in Unicode but has historically struggled with non-Latin …

Unicode in Social Media

Social media platforms handle Unicode text with varying degrees of support, affecting …

Unicode in Data Science and NLP

Natural language processing and data science pipelines frequently encounter Unicode issues including …

Unicode in QR Codes

QR codes can encode Unicode text using UTF-8, but many QR code …

Unicode in Passwords: Security Implications

Allowing Unicode characters in passwords increases the keyspace and can improve security, …

← Back to Guides