Unicode in XML and JSON
Both XML and JSON are defined to use Unicode text, but each has its own rules for encoding characters, escaping special code points, and declaring the document encoding. This guide explains Unicode in XML (including the XML declaration and character references) and JSON (including \uXXXX escape sequences and surrogate pair handling).
XML and JSON are the two dominant data interchange formats on the web and in enterprise systems. Both have explicit rules for how Unicode text is represented, escaped, and transmitted. Understanding these rules is critical for building systems that correctly handle multilingual data, special characters, and emoji. This guide covers the Unicode encoding requirements of XML and JSON, their escape mechanisms, and the common pitfalls developers encounter.
XML and Unicode
XML's encoding declaration
XML 1.0 supports UTF-8 and UTF-16 as its primary encodings. The encoding is declared in the XML prologue:
<?xml version="1.0" encoding="UTF-8"?>
If no encoding is declared, XML parsers assume UTF-8 (or UTF-16 if a BOM is present).
| Encoding Declaration | Parser Behavior |
|---|---|
encoding="UTF-8" |
Parse as UTF-8 |
encoding="UTF-16" |
Parse as UTF-16 |
encoding="ISO-8859-1" |
Parse as Latin-1 |
| No declaration + no BOM | Assume UTF-8 |
| No declaration + UTF-16 BOM | Parse as UTF-16 |
Best practice: Always declare encoding="UTF-8" and actually encode the file as
UTF-8. This eliminates ambiguity and maximizes compatibility.
Character references in XML
XML provides two types of character references for inserting Unicode characters that cannot (or should not) be typed directly:
| Type | Syntax | Example | Result |
|---|---|---|---|
| Decimal | &#dddd; |
© |
(c) (copyright) |
| Hexadecimal | &#xhhhh; |
© |
(c) (copyright) |
| Named entity | &name; |
& |
& |
Decimal and hex references can express any Unicode code point:
| Reference | Code Point | Character |
|---|---|---|
A |
U+0041 | A |
☃ |
U+2603 | Snowman |
😀 |
U+1F600 | Grinning face |
中 |
U+4E2D | Chinese "middle" |
Predefined XML entities
XML defines only five named entities:
| Entity | Character | When Required |
|---|---|---|
< |
< | Always in text content |
> |
> | Recommended in text content |
& |
& | Always |
' |
' | In attribute values with single quotes |
" |
" | In attribute values with double quotes |
Unlike HTML, XML does not define entities like © or —. You must use
numeric references (©, —) or define custom entities in a DTD.
Restricted characters in XML
XML 1.0 restricts which Unicode characters can appear, even with escaping:
| Range | XML 1.0 | XML 1.1 |
|---|---|---|
| U+0000 (null) | Forbidden | Forbidden |
| U+0001-U+0008 | Forbidden | Allowed (as references) |
| U+0009 (tab) | Allowed | Allowed |
| U+000A (line feed) | Allowed | Allowed |
| U+000B-U+000C | Forbidden | Allowed (as references) |
| U+000D (carriage return) | Allowed | Allowed |
| U+000E-U+001F | Forbidden | Allowed (as references) |
| U+0020-U+D7FF | Allowed | Allowed |
| U+D800-U+DFFF | Forbidden (surrogates) | Forbidden |
| U+E000-U+FFFD | Allowed | Allowed |
| U+FFFE-U+FFFF | Forbidden | Forbidden |
| U+10000-U+10FFFF | Allowed | Allowed |
XML 1.1 relaxes restrictions on control characters (allowing them as numeric references) but is rarely used in practice. Most systems use XML 1.0.
CDATA sections
CDATA sections let you include text that would otherwise need escaping:
<code><![CDATA[
if (a < b && c > d) {
// No escaping needed inside CDATA
}
]]></code>
CDATA sections contain raw text — no character references are processed inside them.
The only sequence that cannot appear inside CDATA is ]]> (which ends the section).
JSON and Unicode
JSON's UTF-8 mandate
RFC 8259 (the current JSON standard, published 2017) states clearly:
"JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8."
Earlier versions (RFC 4627, RFC 7159) allowed UTF-8, UTF-16, and UTF-32. The current standard mandates UTF-8 for interoperability.
| Standard | Allowed Encodings |
|---|---|
| RFC 4627 (2006) | UTF-8, UTF-16, UTF-32 |
| RFC 7159 (2014) | UTF-8, UTF-16, UTF-32 |
| RFC 8259 (2017) | UTF-8 (MUST for interchange) |
Best practice: Always produce and consume JSON as UTF-8. If you encounter UTF-16 or UTF-32 JSON, convert to UTF-8 before processing.
Unicode escape sequences in JSON
JSON defines a single escape mechanism for Unicode characters:
\uXXXX
where XXXX is exactly four hexadecimal digits representing a UTF-16 code unit:
| Escape | Code Point | Character |
|---|---|---|
\u0041 |
U+0041 | A |
\u00A9 |
U+00A9 | Copyright sign |
\u2603 |
U+2603 | Snowman |
\u4E2D |
U+4E2D | Chinese "middle" |
Supplementary characters (surrogate pairs)
Characters above U+FFFF cannot be represented in a single \uXXXX escape because they
need more than 4 hex digits. JSON uses UTF-16 surrogate pairs:
\uD83D\uDE00 = U+1F600 (Grinning Face)
The surrogate pair encoding works as follows: 1. Subtract 0x10000 from the code point: 0x1F600 - 0x10000 = 0xF600 2. High surrogate: 0xD800 + (0xF600 >> 10) = 0xD800 + 0x3D = 0xD83D 3. Low surrogate: 0xDC00 + (0xF600 & 0x3FF) = 0xDC00 + 0x200 = 0xDE00
| Code Point | Surrogate Pair | Character |
|---|---|---|
| U+1F600 | \uD83D\uDE00 |
Grinning face |
| U+1F4A9 | \uD83D\uDCA9 |
Pile of poo |
| U+10348 | \uD800\uDF48 |
Gothic letter hwair |
| U+20000 | \uD840\uDC00 |
CJK Unified Ideograph extension B |
Important: A lone surrogate (\uD83D without a following low surrogate) is
technically invalid JSON per RFC 8259 but is tolerated by many parsers.
Required escapes in JSON strings
JSON requires escaping for these characters:
| Character | Escape | Notes |
|---|---|---|
" (quotation mark) |
\" |
String delimiter |
\ (backslash) |
\\ |
Escape character |
| Control chars U+0000-U+001F | \uXXXX |
Must be escaped |
/ (solidus) |
\/ |
Optional but allowed |
| Backspace (U+0008) | \b |
Shorthand |
| Form feed (U+000C) | \f |
Shorthand |
| Line feed (U+000A) | \n |
Shorthand |
| Carriage return (U+000D) | \r |
Shorthand |
| Tab (U+0009) | \t |
Shorthand |
All other Unicode characters (including CJK, emoji, accented letters) can appear literally in JSON strings — they do not need escaping if the file is UTF-8 encoded.
Escaping vs literal characters
Both of these JSON strings are valid and semantically identical:
{"city": "Munchen"}
{"city": "M\u00FCnchen"}
The first uses a literal UTF-8 character; the second uses a \u escape. Prefer literal
characters for readability, and let your JSON library handle escaping for control
characters.
Comparing XML and JSON Unicode Handling
| Feature | XML | JSON |
|---|---|---|
| Default encoding | UTF-8 (assumed) | UTF-8 (mandated by RFC 8259) |
| Encoding declaration | Yes (encoding="...") |
No (always UTF-8) |
| Character references | &#dddd; / &#xhhhh; |
\uXXXX |
| Supplementary characters | Direct: 😀 |
Surrogate pairs: \uD83D\uDE00 |
| Named entities | 5 predefined + DTD-defined | None |
| Control characters | Mostly forbidden | Must be escaped (\uXXXX) |
| BOM | Allowed (UTF-16 detection) | Allowed but discouraged |
Key difference: supplementary character handling
XML's approach is simpler — you can reference any code point directly with 😀.
JSON requires UTF-16 surrogate pairs for code points above U+FFFF, which is more complex
and error-prone. In practice, most JSON libraries handle this transparently.
Common Pitfalls
Pitfall 1: Double encoding
{"name": "M\\u00FCnchen"}
This is double-escaped — the backslash is escaped, so the parser sees the literal
string M\u00FCnchen instead of Munchen. This usually happens when a JSON string is
encoded twice (e.g., json.dumps(json.dumps(data))).
Pitfall 2: XML with wrong encoding declaration
<?xml version="1.0" encoding="ISO-8859-1"?>
<name>Munchen</name>
If the file is actually UTF-8 but declares ISO-8859-1, multibyte UTF-8 sequences will be misinterpreted, producing garbled text.
Pitfall 3: Lone surrogates in JSON
{"emoji": "\uD83D"}
A lone high surrogate without a matching low surrogate is invalid. Most parsers reject
this, but some (Python's json module with default settings) may produce a string
containing an unpaired surrogate, which causes problems downstream.
Pitfall 4: Null bytes
Both XML and JSON forbid the null byte (U+0000). If your data contains null bytes (e.g., from binary data incorrectly treated as text), serialization will fail or produce corrupt output.
Best Practices
- Always use UTF-8: For both XML and JSON, UTF-8 is the universal standard.
- Validate encoding at boundaries: When receiving XML/JSON from external sources, validate that the declared encoding matches the actual byte encoding.
- Let libraries handle escaping: Never manually construct XML or JSON strings with string concatenation. Use proper serialization libraries.
- Normalize Unicode: Apply NFC normalization before serialization if you need consistent comparison (see the Unicode normalization guide).
- Test with diverse characters: Include CJK, emoji, RTL text, and combining characters in your test data.
Key Takeaways
- XML supports character references (
&#xhhhh;) for any Unicode code point directly, while JSON requires UTF-16 surrogate pairs (\uD83D\uDE00) for characters above U+FFFF. - JSON must be UTF-8 per RFC 8259. XML defaults to UTF-8 but allows other encodings
via the
encodingdeclaration. - XML forbids most control characters; JSON requires them to be escaped but permits them.
- Both formats handle the full range of Unicode when used correctly. The most common errors are encoding mismatches, double escaping, and lone surrogates.
- Use proper serialization libraries — never hand-craft XML or JSON string escaping.
More in Platform Guides
Microsoft Word supports the full Unicode character set and provides several methods …
Google Docs and Sheets use UTF-8 internally and provide a Special Characters …
Modern terminals support Unicode and UTF-8, but correctly displaying all Unicode characters …
PDF supports Unicode text through embedded fonts and ToUnicode maps, but many …
Microsoft Excel stores text in Unicode but has historically struggled with non-Latin …
Social media platforms handle Unicode text with varying degrees of support, affecting …
Natural language processing and data science pipelines frequently encounter Unicode issues including …
QR codes can encode Unicode text using UTF-8, but many QR code …
Allowing Unicode characters in passwords increases the keyspace and can improve security, …