📚 Unicode Fundamentals

Surrogate Pairs Explained

Surrogate pairs are a mechanism in UTF-16 that allows code points outside the BMP to be represented using two 16-bit code units. This guide explains how surrogate pairs work, why they exist, and the bugs they can cause in JavaScript and other languages.

Published 2021-06-14 · Updated 2024-05-28

Surrogate pairs are one of the most misunderstood corners of Unicode — and one of the most common sources of bugs in JavaScript, Java, C#, and any language that uses UTF-16 internally. If you have ever seen a mysterious \uD83D appear in a log file, watched .length return 2 for a single emoji, or truncated a string only to produce garbled output, you have already met surrogate pairs. This guide explains what they are, why they exist, exactly how the math works, and how to handle them correctly in real code.

Why Surrogate Pairs Exist

When Unicode was first designed in the late 1980s, the architects believed that 65,536 code points (a 16-bit space) would be enough to cover every character humanity would ever need. Early Unicode was a fixed-width 16-bit encoding, and the Basic Multilingual Plane (BMP) — code points U+0000 through U+FFFF — was the entire standard.

That assumption proved wrong. As more scripts, historical writing systems, musical notation, mathematical symbols, and eventually emoji were added, Unicode expanded to 1,114,112 code points (U+0000 through U+10FFFF), organized into 17 planes. The code points above U+FFFF are called supplementary characters and live on planes 1 through 16.

The problem: billions of lines of code, operating systems, and file formats (notably Windows, Java, JavaScript, and .NET) had already committed to 16-bit code units. A single 16-bit integer cannot represent a value larger than 65,535, so there was no room for supplementary characters.

The solution was surrogate pairs — a clever encoding trick that uses two 16-bit code units to represent one supplementary code point. A permanent block of 2,048 code points in the BMP (U+D800 through U+DFFF) was set aside exclusively for this mechanism. These code points are never assigned to characters and are illegal in UTF-8 and UTF-32.

The Surrogate Ranges

The 2,048 reserved surrogate code points are split into two halves:

Range	Name	Count	Purpose
U+D800 – U+DBFF	High surrogates (lead)	1,024	First code unit of the pair
U+DC00 – U+DFFF	Low surrogates (trail)	1,024	Second code unit of the pair

A valid surrogate pair always consists of one high surrogate followed immediately by one low surrogate. Any other combination — a high surrogate alone, a low surrogate alone, or a low followed by a high — is called a lone surrogate and is invalid.

The Encoding Math

Surrogate pairs encode the 20-bit value that distinguishes supplementary characters. Here is the step-by-step algorithm:

Encoding: Code Point to Surrogate Pair

Given a code point cp in the range U+10000 to U+10FFFF:

Subtract the BMP offset: offset = cp - 0x10000 This produces a 20-bit value (0x00000 through 0xFFFFF).
Split into two 10-bit halves:
High 10 bits: hi = (offset >> 10) & 0x3FF
Low 10 bits: lo = offset & 0x3FF
Add the surrogate base values:
High surrogate: lead = 0xD800 + hi
Low surrogate: trail = 0xDC00 + lo

Decoding: Surrogate Pair to Code Point

Given a high surrogate lead and low surrogate trail:

cp = (lead - 0xD800) * 0x400 + (trail - 0xDC00) + 0x10000

Worked Example: Grinning Face U+1F600

Code point:  0x1F600
Offset:      0x1F600 - 0x10000 = 0xF600
High 10 bits: 0xF600 >> 10 = 0x3D  → high surrogate = 0xD800 + 0x3D = 0xD83D
Low 10 bits:  0xF600 & 0x3FF = 0x200 → low surrogate = 0xDC00 + 0x200 = 0xDE00

Surrogate pair: U+D83D U+DE00

Verify the reverse:

(0xD83D - 0xD800) * 0x400 + (0xDE00 - 0xDC00) + 0x10000
= 0x3D * 0x400 + 0x200 + 0x10000
= 0xF400 + 0x200 + 0x10000
= 0x1F600  ✓

Common Supplementary Characters and Their Surrogates

Character	Code Point	High	Low	Name
😀	U+1F600	D83D	DE00	Grinning Face
🐍	U+1F40D	D83D	DC0D	Snake
𝕳	U+1D573	D835	DD73	Mathematical Fraktur Capital H
🎵	U+1F3B5	D83C	DFB5	Musical Note
𐐀	U+10400	D801	DC00	Deseret Capital Long I
🏴	U+1F3F4	D83C	DFF4	Black Flag
𝄞	U+1D11E	D834	DD1E	Musical Symbol G Clef

Where Surrogate Pairs Appear

Surrogate pairs are a UTF-16 mechanism only. They do not exist in UTF-8 or UTF-32:

Encoding	How it handles U+1F600
UTF-8	4 bytes: `F0 9F 98 80` — no surrogates involved
UTF-16	2 code units: `D83D DE00` — surrogate pair
UTF-32	1 code unit: `0001F600` — direct representation

Languages and runtimes that use UTF-16 internally include:

JavaScript — strings are sequences of UTF-16 code units
Java — char is 16-bit, String stores UTF-16
C#/.NET — char is 16-bit UTF-16
Windows API — wchar_t is 16-bit, all W functions use UTF-16
Objective-C/Swift — NSString/String use UTF-16 under the hood
Qt (C++) — QString stores UTF-16

Surrogate Pair Bugs in JavaScript

JavaScript is where surrogate pair bugs are most visible, because the language exposes UTF-16 code units directly through the string API.

Bug 1: `.length` Lies

const grin = "\u{1F600}";
console.log(grin);          // 😀
console.log(grin.length);   // 2  (two UTF-16 code units)

The .length property counts code units, not characters or code points. Every supplementary character adds 2 to the length.

Fix: Use the spread operator or Array.from to iterate by code point:

[...grin].length;              // 1
Array.from(grin).length;       // 1

Bug 2: Indexing Splits the Pair

const snake = "\u{1F40D}";
snake[0];                    // "\uD83D" — lone high surrogate (broken)
snake[1];                    // "\uDC0D" — lone low surrogate (broken)

Fix: Spread first, then index:

[...snake][0];               // "🐍" — correct
snake.codePointAt(0);        // 128013 (0x1F40D) — correct code point

Bug 3: String Truncation Produces Lone Surrogates

const msg = "Hello 😀 World";
const cut = msg.slice(0, 7);
// cut === "Hello \uD83D" — the high surrogate is orphaned

Fix: Use Intl.Segmenter or test for well-formedness:

// ES2024: check and fix lone surrogates
msg.slice(0, 7).isWellFormed();    // false
msg.slice(0, 7).toWellFormed();    // "Hello \uFFFD" — replaced with U+FFFD

Bug 4: Regex Without `/u` Flag

/^.$/.test("😀");     // false — regex sees two code units
/^.$/u.test("😀");    // true  — /u flag enables Unicode-aware matching

Bug 5: `charCodeAt` vs `codePointAt`

"😀".charCodeAt(0);      // 55357 (0xD83D) — high surrogate, wrong
"😀".codePointAt(0);     // 128512 (0x1F600) — correct code point

String.fromCharCode(0x1F600);     // gibberish — value overflows 16 bits
String.fromCodePoint(0x1F600);    // "😀" — correct

Surrogate Pairs in Java

Java's char type is 16 bits, so supplementary characters require two char values:

String emoji = "\uD83D\uDE00";    // 😀 via surrogate pair
String emoji2 = new String(Character.toChars(0x1F600));  // cleaner

emoji.length();                    // 2 (code units)
emoji.codePointCount(0, emoji.length());  // 1 (code points)

emoji.charAt(0);                   // '\uD83D' — high surrogate
emoji.codePointAt(0);              // 128512 (0x1F600) — correct

Best practices in Java:

Use codePointAt() instead of charAt() for character access
Use codePointCount() instead of length() for character counting
Use codePoints() stream for iteration:

"Hello 😀".codePoints().forEach(cp ->
    System.out.println(Character.toString(cp))
);

Surrogate Pairs in Python

Python 3 strings are sequences of code points, not code units. This means Python handles supplementary characters correctly by default:

emoji = "\U0001F600"
len(emoji)           # 1 — correct, Python counts code points
emoji[0]             # '😀' — no surrogate pair exposure
ord(emoji)           # 128512 — correct code point value

However, surrogates can still appear when working with external data:

# Encoding to UTF-16 exposes surrogates at the byte level
"😀".encode("utf-16-le")    # b'=\xd8\x00\xde' — D83D DE00 in little-endian

# The surrogatepass error handler allows encoding/decoding lone surrogates
b"\xed\xa0\x80".decode("utf-8", errors="surrogatepass")  # '\ud800'

Lone Surrogates: The Real Danger

A lone surrogate is a high surrogate without a following low surrogate, or a low surrogate without a preceding high surrogate. Lone surrogates are invalid in all Unicode encoding forms and cause real-world problems:

Problem	Effect
Database storage	PostgreSQL and MySQL in strict mode reject lone surrogates
JSON serialization	`JSON.stringify` may produce invalid JSON with `\uD800` escapes
File names	NTFS allows lone surrogates in file names, but other systems reject them
Network transmission	HTTP/2 and WebSocket frames must be well-formed UTF-8 — lone surrogates crash the connection
Security	Attackers use overlong encodings of surrogates to bypass input filters

Detecting Lone Surrogates

// JavaScript (ES2024)
"hello".isWellFormed();       // true
"\uD800".isWellFormed();      // false

// Regex approach (older environments)
function hasLoneSurrogate(str) {
    return /(?:[\uD800-\uDBFF](?![\uDC00-\uDFFF]))|(?:(?<![\uD800-\uDBFF])[\uDC00-\uDFFF])/.test(str);
}

# Python: detect surrogates in a string
import re
def has_surrogates(s: str) -> bool:
    return bool(re.search("[\ud800-\udfff]", s))

Summary Table

Concept	Value
Surrogate range	U+D800 – U+DFFF (2,048 code points)
High surrogates	U+D800 – U+DBFF (1,024)
Low surrogates	U+DC00 – U+DFFF (1,024)
Characters encodable	1,024 x 1,024 = 1,048,576 (planes 1–16)
Encoding affected	UTF-16 only
Not affected	UTF-8, UTF-32
Used by	JavaScript, Java, C#, Windows API, Qt
Not used by	Python 3 (code point strings), Rust, Go (UTF-8 native)

Best Practices

Use code-point-aware APIs: codePointAt() over charCodeAt() in JavaScript, codePointAt() over charAt() in Java.
Count correctly: Use [...str].length in JavaScript or codePointCount() in Java instead of .length.
Never truncate blindly: Check that your truncation point does not land between the two halves of a surrogate pair.
Validate input: Reject or replace lone surrogates at your system boundary — when reading files, accepting network input, or parsing JSON.
Prefer UTF-8 for storage and transport: UTF-8 does not use surrogates. Converting from UTF-16 to UTF-8 at the boundary eliminates the entire class of surrogate bugs.
Use the /u flag in JavaScript regex: Without it, . matches one code unit and character classes do not understand supplementary characters.
Test with emoji: The easiest way to catch surrogate-related bugs is to include emoji and other supplementary characters in your test data.