📚 Unicode Fundamentals

Surrogate Pairs Explained

Surrogate pairs are a mechanism in UTF-16 that allows code points outside the BMP to be represented using two 16-bit code units. This guide explains how surrogate pairs work, why they exist, and the bugs they can cause in JavaScript and other languages.

·

Surrogate pairs are one of the most misunderstood corners of Unicode — and one of the most common sources of bugs in JavaScript, Java, C#, and any language that uses UTF-16 internally. If you have ever seen a mysterious \uD83D appear in a log file, watched .length return 2 for a single emoji, or truncated a string only to produce garbled output, you have already met surrogate pairs. This guide explains what they are, why they exist, exactly how the math works, and how to handle them correctly in real code.

Why Surrogate Pairs Exist

When Unicode was first designed in the late 1980s, the architects believed that 65,536 code points (a 16-bit space) would be enough to cover every character humanity would ever need. Early Unicode was a fixed-width 16-bit encoding, and the Basic Multilingual Plane (BMP) — code points U+0000 through U+FFFF — was the entire standard.

That assumption proved wrong. As more scripts, historical writing systems, musical notation, mathematical symbols, and eventually emoji were added, Unicode expanded to 1,114,112 code points (U+0000 through U+10FFFF), organized into 17 planes. The code points above U+FFFF are called supplementary characters and live on planes 1 through 16.

The problem: billions of lines of code, operating systems, and file formats (notably Windows, Java, JavaScript, and .NET) had already committed to 16-bit code units. A single 16-bit integer cannot represent a value larger than 65,535, so there was no room for supplementary characters.

The solution was surrogate pairs — a clever encoding trick that uses two 16-bit code units to represent one supplementary code point. A permanent block of 2,048 code points in the BMP (U+D800 through U+DFFF) was set aside exclusively for this mechanism. These code points are never assigned to characters and are illegal in UTF-8 and UTF-32.

The Surrogate Ranges

The 2,048 reserved surrogate code points are split into two halves:

Range Name Count Purpose
U+D800 – U+DBFF High surrogates (lead) 1,024 First code unit of the pair
U+DC00 – U+DFFF Low surrogates (trail) 1,024 Second code unit of the pair

A valid surrogate pair always consists of one high surrogate followed immediately by one low surrogate. Any other combination — a high surrogate alone, a low surrogate alone, or a low followed by a high — is called a lone surrogate and is invalid.

The Encoding Math

Surrogate pairs encode the 20-bit value that distinguishes supplementary characters. Here is the step-by-step algorithm:

Encoding: Code Point to Surrogate Pair

Given a code point cp in the range U+10000 to U+10FFFF:

  1. Subtract the BMP offset: offset = cp - 0x10000 This produces a 20-bit value (0x00000 through 0xFFFFF).

  2. Split into two 10-bit halves:

  3. High 10 bits: hi = (offset >> 10) & 0x3FF
  4. Low 10 bits: lo = offset & 0x3FF

  5. Add the surrogate base values:

  6. High surrogate: lead = 0xD800 + hi
  7. Low surrogate: trail = 0xDC00 + lo

Decoding: Surrogate Pair to Code Point

Given a high surrogate lead and low surrogate trail:

cp = (lead - 0xD800) * 0x400 + (trail - 0xDC00) + 0x10000

Worked Example: Grinning Face U+1F600

Code point:  0x1F600
Offset:      0x1F600 - 0x10000 = 0xF600
High 10 bits: 0xF600 >> 10 = 0x3D  → high surrogate = 0xD800 + 0x3D = 0xD83D
Low 10 bits:  0xF600 & 0x3FF = 0x200 → low surrogate = 0xDC00 + 0x200 = 0xDE00

Surrogate pair: U+D83D U+DE00

Verify the reverse:

(0xD83D - 0xD800) * 0x400 + (0xDE00 - 0xDC00) + 0x10000
= 0x3D * 0x400 + 0x200 + 0x10000
= 0xF400 + 0x200 + 0x10000
= 0x1F600  ✓

Common Supplementary Characters and Their Surrogates

Character Code Point High Low Name
😀 U+1F600 D83D DE00 Grinning Face
🐍 U+1F40D D83D DC0D Snake
𝕳 U+1D573 D835 DD73 Mathematical Fraktur Capital H
🎵 U+1F3B5 D83C DFB5 Musical Note
𐐀 U+10400 D801 DC00 Deseret Capital Long I
🏴 U+1F3F4 D83C DFF4 Black Flag
𝄞 U+1D11E D834 DD1E Musical Symbol G Clef

Where Surrogate Pairs Appear

Surrogate pairs are a UTF-16 mechanism only. They do not exist in UTF-8 or UTF-32:

Encoding How it handles U+1F600
UTF-8 4 bytes: F0 9F 98 80 — no surrogates involved
UTF-16 2 code units: D83D DE00 — surrogate pair
UTF-32 1 code unit: 0001F600 — direct representation

Languages and runtimes that use UTF-16 internally include:

  • JavaScript — strings are sequences of UTF-16 code units
  • Javachar is 16-bit, String stores UTF-16
  • C#/.NETchar is 16-bit UTF-16
  • Windows APIwchar_t is 16-bit, all W functions use UTF-16
  • Objective-C/SwiftNSString/String use UTF-16 under the hood
  • Qt (C++)QString stores UTF-16

Surrogate Pair Bugs in JavaScript

JavaScript is where surrogate pair bugs are most visible, because the language exposes UTF-16 code units directly through the string API.

Bug 1: .length Lies

const grin = "\u{1F600}";
console.log(grin);          // 😀
console.log(grin.length);   // 2  (two UTF-16 code units)

The .length property counts code units, not characters or code points. Every supplementary character adds 2 to the length.

Fix: Use the spread operator or Array.from to iterate by code point:

[...grin].length;              // 1
Array.from(grin).length;       // 1

Bug 2: Indexing Splits the Pair

const snake = "\u{1F40D}";
snake[0];                    // "\uD83D" — lone high surrogate (broken)
snake[1];                    // "\uDC0D" — lone low surrogate (broken)

Fix: Spread first, then index:

[...snake][0];               // "🐍" — correct
snake.codePointAt(0);        // 128013 (0x1F40D) — correct code point

Bug 3: String Truncation Produces Lone Surrogates

const msg = "Hello 😀 World";
const cut = msg.slice(0, 7);
// cut === "Hello \uD83D" — the high surrogate is orphaned

Fix: Use Intl.Segmenter or test for well-formedness:

// ES2024: check and fix lone surrogates
msg.slice(0, 7).isWellFormed();    // false
msg.slice(0, 7).toWellFormed();    // "Hello \uFFFD" — replaced with U+FFFD

Bug 4: Regex Without /u Flag

/^.$/.test("😀");     // false — regex sees two code units
/^.$/u.test("😀");    // true  — /u flag enables Unicode-aware matching

Bug 5: charCodeAt vs codePointAt

"😀".charCodeAt(0);      // 55357 (0xD83D) — high surrogate, wrong
"😀".codePointAt(0);     // 128512 (0x1F600) — correct code point

String.fromCharCode(0x1F600);     // gibberish — value overflows 16 bits
String.fromCodePoint(0x1F600);    // "😀" — correct

Surrogate Pairs in Java

Java's char type is 16 bits, so supplementary characters require two char values:

String emoji = "\uD83D\uDE00";    // 😀 via surrogate pair
String emoji2 = new String(Character.toChars(0x1F600));  // cleaner

emoji.length();                    // 2 (code units)
emoji.codePointCount(0, emoji.length());  // 1 (code points)

emoji.charAt(0);                   // '\uD83D' — high surrogate
emoji.codePointAt(0);              // 128512 (0x1F600) — correct

Best practices in Java:

  • Use codePointAt() instead of charAt() for character access
  • Use codePointCount() instead of length() for character counting
  • Use codePoints() stream for iteration:
"Hello 😀".codePoints().forEach(cp ->
    System.out.println(Character.toString(cp))
);

Surrogate Pairs in Python

Python 3 strings are sequences of code points, not code units. This means Python handles supplementary characters correctly by default:

emoji = "\U0001F600"
len(emoji)           # 1 — correct, Python counts code points
emoji[0]             # '😀' — no surrogate pair exposure
ord(emoji)           # 128512 — correct code point value

However, surrogates can still appear when working with external data:

# Encoding to UTF-16 exposes surrogates at the byte level
"😀".encode("utf-16-le")    # b'=\xd8\x00\xde' — D83D DE00 in little-endian

# The surrogatepass error handler allows encoding/decoding lone surrogates
b"\xed\xa0\x80".decode("utf-8", errors="surrogatepass")  # '\ud800'

Lone Surrogates: The Real Danger

A lone surrogate is a high surrogate without a following low surrogate, or a low surrogate without a preceding high surrogate. Lone surrogates are invalid in all Unicode encoding forms and cause real-world problems:

Problem Effect
Database storage PostgreSQL and MySQL in strict mode reject lone surrogates
JSON serialization JSON.stringify may produce invalid JSON with \uD800 escapes
File names NTFS allows lone surrogates in file names, but other systems reject them
Network transmission HTTP/2 and WebSocket frames must be well-formed UTF-8 — lone surrogates crash the connection
Security Attackers use overlong encodings of surrogates to bypass input filters

Detecting Lone Surrogates

// JavaScript (ES2024)
"hello".isWellFormed();       // true
"\uD800".isWellFormed();      // false

// Regex approach (older environments)
function hasLoneSurrogate(str) {
    return /(?:[\uD800-\uDBFF](?![\uDC00-\uDFFF]))|(?:(?<![\uD800-\uDBFF])[\uDC00-\uDFFF])/.test(str);
}
# Python: detect surrogates in a string
import re
def has_surrogates(s: str) -> bool:
    return bool(re.search("[\ud800-\udfff]", s))

Summary Table

Concept Value
Surrogate range U+D800 – U+DFFF (2,048 code points)
High surrogates U+D800 – U+DBFF (1,024)
Low surrogates U+DC00 – U+DFFF (1,024)
Characters encodable 1,024 x 1,024 = 1,048,576 (planes 1–16)
Encoding affected UTF-16 only
Not affected UTF-8, UTF-32
Used by JavaScript, Java, C#, Windows API, Qt
Not used by Python 3 (code point strings), Rust, Go (UTF-8 native)

Best Practices

  1. Use code-point-aware APIs: codePointAt() over charCodeAt() in JavaScript, codePointAt() over charAt() in Java.

  2. Count correctly: Use [...str].length in JavaScript or codePointCount() in Java instead of .length.

  3. Never truncate blindly: Check that your truncation point does not land between the two halves of a surrogate pair.

  4. Validate input: Reject or replace lone surrogates at your system boundary — when reading files, accepting network input, or parsing JSON.

  5. Prefer UTF-8 for storage and transport: UTF-8 does not use surrogates. Converting from UTF-16 to UTF-8 at the boundary eliminates the entire class of surrogate bugs.

  6. Use the /u flag in JavaScript regex: Without it, . matches one code unit and character classes do not understand supplementary characters.

  7. Test with emoji: The easiest way to catch surrogate-related bugs is to include emoji and other supplementary characters in your test data.

Unicode Fundamentals のその他のガイド

What is Unicode? A Complete Guide

Unicode is the universal character encoding standard that assigns a unique number …

UTF-8 Encoding Explained

UTF-8 is the dominant character encoding on the web, capable of representing …

UTF-8 vs UTF-16 vs UTF-32: When to Use Each

UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …

What is a Unicode Code Point?

A Unicode code point is the unique number assigned to each character …

Unicode Planes and the BMP

Unicode is divided into 17 planes, each containing up to 65,536 code …

Understanding Byte Order Mark (BOM)

The Byte Order Mark (BOM) is a special Unicode character used at …

ASCII to Unicode: The Evolution of Character Encoding

ASCII defined 128 characters for the English alphabet and was the foundation …

Unicode Normalization: NFC, NFD, NFKC, NFKD

The same visible character can be represented by multiple different byte sequences …

The Unicode Bidirectional Algorithm

The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …

Unicode General Categories Explained

Every Unicode character belongs to a general category such as Letter, Number, …

Understanding Unicode Blocks

Unicode blocks are contiguous ranges of code points grouped by script or …

Unicode Scripts: How Writing Systems are Organized

Unicode assigns every character to a script property that identifies the writing …

What are Combining Characters?

Combining characters are Unicode code points that attach to a preceding base …

Grapheme Clusters vs Code Points

A single visible character on screen — called a grapheme cluster — …

Unicode Confusables: A Security Guide

Unicode confusables are characters that look identical or nearly identical to others, …

Zero Width Characters: What They Are and Why They Matter

Zero-width characters are invisible Unicode code points that affect text layout, joining, …

Unicode Whitespace Characters Guide

Unicode defines over two dozen whitespace characters beyond the ordinary space, including …

History of Unicode

Unicode began in 1987 as a collaboration between engineers at Apple and …

Unicode Versions Timeline

Unicode has released major versions regularly since 1.0 in 1991, with each …