🖥️ Platform Guides

Unicode in XML and JSON

Both XML and JSON are defined to use Unicode text, but each has its own rules for encoding characters, escaping special code points, and declaring the document encoding. This guide explains Unicode in XML (including the XML declaration and character references) and JSON (including \uXXXX escape sequences and surrogate pair handling).

·

XML and JSON are the two dominant data interchange formats on the web and in enterprise systems. Both have explicit rules for how Unicode text is represented, escaped, and transmitted. Understanding these rules is critical for building systems that correctly handle multilingual data, special characters, and emoji. This guide covers the Unicode encoding requirements of XML and JSON, their escape mechanisms, and the common pitfalls developers encounter.

XML and Unicode

XML's encoding declaration

XML 1.0 supports UTF-8 and UTF-16 as its primary encodings. The encoding is declared in the XML prologue:

<?xml version="1.0" encoding="UTF-8"?>

If no encoding is declared, XML parsers assume UTF-8 (or UTF-16 if a BOM is present).

Encoding Declaration Parser Behavior
encoding="UTF-8" Parse as UTF-8
encoding="UTF-16" Parse as UTF-16
encoding="ISO-8859-1" Parse as Latin-1
No declaration + no BOM Assume UTF-8
No declaration + UTF-16 BOM Parse as UTF-16

Best practice: Always declare encoding="UTF-8" and actually encode the file as UTF-8. This eliminates ambiguity and maximizes compatibility.

Character references in XML

XML provides two types of character references for inserting Unicode characters that cannot (or should not) be typed directly:

Type Syntax Example Result
Decimal &#dddd; &#169; (c) (copyright)
Hexadecimal &#xhhhh; &#x00A9; (c) (copyright)
Named entity &name; &amp; &

Decimal and hex references can express any Unicode code point:

Reference Code Point Character
&#65; U+0041 A
&#x2603; U+2603 Snowman
&#x1F600; U+1F600 Grinning face
&#x4E2D; U+4E2D Chinese "middle"

Predefined XML entities

XML defines only five named entities:

Entity Character When Required
&lt; < Always in text content
&gt; > Recommended in text content
&amp; & Always
&apos; ' In attribute values with single quotes
&quot; " In attribute values with double quotes

Unlike HTML, XML does not define entities like &copy; or &mdash;. You must use numeric references (&#169;, &#x2014;) or define custom entities in a DTD.

Restricted characters in XML

XML 1.0 restricts which Unicode characters can appear, even with escaping:

Range XML 1.0 XML 1.1
U+0000 (null) Forbidden Forbidden
U+0001-U+0008 Forbidden Allowed (as references)
U+0009 (tab) Allowed Allowed
U+000A (line feed) Allowed Allowed
U+000B-U+000C Forbidden Allowed (as references)
U+000D (carriage return) Allowed Allowed
U+000E-U+001F Forbidden Allowed (as references)
U+0020-U+D7FF Allowed Allowed
U+D800-U+DFFF Forbidden (surrogates) Forbidden
U+E000-U+FFFD Allowed Allowed
U+FFFE-U+FFFF Forbidden Forbidden
U+10000-U+10FFFF Allowed Allowed

XML 1.1 relaxes restrictions on control characters (allowing them as numeric references) but is rarely used in practice. Most systems use XML 1.0.

CDATA sections

CDATA sections let you include text that would otherwise need escaping:

<code><![CDATA[
    if (a < b && c > d) {
        // No escaping needed inside CDATA
    }
]]></code>

CDATA sections contain raw text — no character references are processed inside them. The only sequence that cannot appear inside CDATA is ]]> (which ends the section).

JSON and Unicode

JSON's UTF-8 mandate

RFC 8259 (the current JSON standard, published 2017) states clearly:

"JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8."

Earlier versions (RFC 4627, RFC 7159) allowed UTF-8, UTF-16, and UTF-32. The current standard mandates UTF-8 for interoperability.

Standard Allowed Encodings
RFC 4627 (2006) UTF-8, UTF-16, UTF-32
RFC 7159 (2014) UTF-8, UTF-16, UTF-32
RFC 8259 (2017) UTF-8 (MUST for interchange)

Best practice: Always produce and consume JSON as UTF-8. If you encounter UTF-16 or UTF-32 JSON, convert to UTF-8 before processing.

Unicode escape sequences in JSON

JSON defines a single escape mechanism for Unicode characters:

\uXXXX

where XXXX is exactly four hexadecimal digits representing a UTF-16 code unit:

Escape Code Point Character
\u0041 U+0041 A
\u00A9 U+00A9 Copyright sign
\u2603 U+2603 Snowman
\u4E2D U+4E2D Chinese "middle"

Supplementary characters (surrogate pairs)

Characters above U+FFFF cannot be represented in a single \uXXXX escape because they need more than 4 hex digits. JSON uses UTF-16 surrogate pairs:

\uD83D\uDE00  =  U+1F600 (Grinning Face)

The surrogate pair encoding works as follows: 1. Subtract 0x10000 from the code point: 0x1F600 - 0x10000 = 0xF600 2. High surrogate: 0xD800 + (0xF600 >> 10) = 0xD800 + 0x3D = 0xD83D 3. Low surrogate: 0xDC00 + (0xF600 & 0x3FF) = 0xDC00 + 0x200 = 0xDE00

Code Point Surrogate Pair Character
U+1F600 \uD83D\uDE00 Grinning face
U+1F4A9 \uD83D\uDCA9 Pile of poo
U+10348 \uD800\uDF48 Gothic letter hwair
U+20000 \uD840\uDC00 CJK Unified Ideograph extension B

Important: A lone surrogate (\uD83D without a following low surrogate) is technically invalid JSON per RFC 8259 but is tolerated by many parsers.

Required escapes in JSON strings

JSON requires escaping for these characters:

Character Escape Notes
" (quotation mark) \" String delimiter
\ (backslash) \\ Escape character
Control chars U+0000-U+001F \uXXXX Must be escaped
/ (solidus) \/ Optional but allowed
Backspace (U+0008) \b Shorthand
Form feed (U+000C) \f Shorthand
Line feed (U+000A) \n Shorthand
Carriage return (U+000D) \r Shorthand
Tab (U+0009) \t Shorthand

All other Unicode characters (including CJK, emoji, accented letters) can appear literally in JSON strings — they do not need escaping if the file is UTF-8 encoded.

Escaping vs literal characters

Both of these JSON strings are valid and semantically identical:

{"city": "Munchen"}
{"city": "M\u00FCnchen"}

The first uses a literal UTF-8 character; the second uses a \u escape. Prefer literal characters for readability, and let your JSON library handle escaping for control characters.

Comparing XML and JSON Unicode Handling

Feature XML JSON
Default encoding UTF-8 (assumed) UTF-8 (mandated by RFC 8259)
Encoding declaration Yes (encoding="...") No (always UTF-8)
Character references &#dddd; / &#xhhhh; \uXXXX
Supplementary characters Direct: &#x1F600; Surrogate pairs: \uD83D\uDE00
Named entities 5 predefined + DTD-defined None
Control characters Mostly forbidden Must be escaped (\uXXXX)
BOM Allowed (UTF-16 detection) Allowed but discouraged

Key difference: supplementary character handling

XML's approach is simpler — you can reference any code point directly with &#x1F600;. JSON requires UTF-16 surrogate pairs for code points above U+FFFF, which is more complex and error-prone. In practice, most JSON libraries handle this transparently.

Common Pitfalls

Pitfall 1: Double encoding

{"name": "M\\u00FCnchen"}

This is double-escaped — the backslash is escaped, so the parser sees the literal string M\u00FCnchen instead of Munchen. This usually happens when a JSON string is encoded twice (e.g., json.dumps(json.dumps(data))).

Pitfall 2: XML with wrong encoding declaration

<?xml version="1.0" encoding="ISO-8859-1"?>
<name>Munchen</name>

If the file is actually UTF-8 but declares ISO-8859-1, multibyte UTF-8 sequences will be misinterpreted, producing garbled text.

Pitfall 3: Lone surrogates in JSON

{"emoji": "\uD83D"}

A lone high surrogate without a matching low surrogate is invalid. Most parsers reject this, but some (Python's json module with default settings) may produce a string containing an unpaired surrogate, which causes problems downstream.

Pitfall 4: Null bytes

Both XML and JSON forbid the null byte (U+0000). If your data contains null bytes (e.g., from binary data incorrectly treated as text), serialization will fail or produce corrupt output.

Best Practices

  1. Always use UTF-8: For both XML and JSON, UTF-8 is the universal standard.
  2. Validate encoding at boundaries: When receiving XML/JSON from external sources, validate that the declared encoding matches the actual byte encoding.
  3. Let libraries handle escaping: Never manually construct XML or JSON strings with string concatenation. Use proper serialization libraries.
  4. Normalize Unicode: Apply NFC normalization before serialization if you need consistent comparison (see the Unicode normalization guide).
  5. Test with diverse characters: Include CJK, emoji, RTL text, and combining characters in your test data.

Key Takeaways

  • XML supports character references (&#xhhhh;) for any Unicode code point directly, while JSON requires UTF-16 surrogate pairs (\uD83D\uDE00) for characters above U+FFFF.
  • JSON must be UTF-8 per RFC 8259. XML defaults to UTF-8 but allows other encodings via the encoding declaration.
  • XML forbids most control characters; JSON requires them to be escaped but permits them.
  • Both formats handle the full range of Unicode when used correctly. The most common errors are encoding mismatches, double escaping, and lone surrogates.
  • Use proper serialization libraries — never hand-craft XML or JSON string escaping.

المزيد في Platform Guides