Surrogate Pairs Explained
Surrogate pairs are a mechanism in UTF-16 that allows code points outside the BMP to be represented using two 16-bit code units. This guide explains how surrogate pairs work, why they exist, and the bugs they can cause in JavaScript and other languages.
Surrogate pairs are one of the most misunderstood corners of Unicode — and one of the most
common sources of bugs in JavaScript, Java, C#, and any language that uses UTF-16 internally.
If you have ever seen a mysterious \uD83D appear in a log file, watched .length return 2
for a single emoji, or truncated a string only to produce garbled output, you have already met
surrogate pairs. This guide explains what they are, why they exist, exactly how the math works,
and how to handle them correctly in real code.
Why Surrogate Pairs Exist
When Unicode was first designed in the late 1980s, the architects believed that 65,536 code points (a 16-bit space) would be enough to cover every character humanity would ever need. Early Unicode was a fixed-width 16-bit encoding, and the Basic Multilingual Plane (BMP) — code points U+0000 through U+FFFF — was the entire standard.
That assumption proved wrong. As more scripts, historical writing systems, musical notation, mathematical symbols, and eventually emoji were added, Unicode expanded to 1,114,112 code points (U+0000 through U+10FFFF), organized into 17 planes. The code points above U+FFFF are called supplementary characters and live on planes 1 through 16.
The problem: billions of lines of code, operating systems, and file formats (notably Windows, Java, JavaScript, and .NET) had already committed to 16-bit code units. A single 16-bit integer cannot represent a value larger than 65,535, so there was no room for supplementary characters.
The solution was surrogate pairs — a clever encoding trick that uses two 16-bit code units to represent one supplementary code point. A permanent block of 2,048 code points in the BMP (U+D800 through U+DFFF) was set aside exclusively for this mechanism. These code points are never assigned to characters and are illegal in UTF-8 and UTF-32.
The Surrogate Ranges
The 2,048 reserved surrogate code points are split into two halves:
| Range | Name | Count | Purpose |
|---|---|---|---|
| U+D800 – U+DBFF | High surrogates (lead) | 1,024 | First code unit of the pair |
| U+DC00 – U+DFFF | Low surrogates (trail) | 1,024 | Second code unit of the pair |
A valid surrogate pair always consists of one high surrogate followed immediately by one low surrogate. Any other combination — a high surrogate alone, a low surrogate alone, or a low followed by a high — is called a lone surrogate and is invalid.
The Encoding Math
Surrogate pairs encode the 20-bit value that distinguishes supplementary characters. Here is the step-by-step algorithm:
Encoding: Code Point to Surrogate Pair
Given a code point cp in the range U+10000 to U+10FFFF:
-
Subtract the BMP offset:
offset = cp - 0x10000This produces a 20-bit value (0x00000 through 0xFFFFF). -
Split into two 10-bit halves:
- High 10 bits:
hi = (offset >> 10) & 0x3FF -
Low 10 bits:
lo = offset & 0x3FF -
Add the surrogate base values:
- High surrogate:
lead = 0xD800 + hi - Low surrogate:
trail = 0xDC00 + lo
Decoding: Surrogate Pair to Code Point
Given a high surrogate lead and low surrogate trail:
cp = (lead - 0xD800) * 0x400 + (trail - 0xDC00) + 0x10000
Worked Example: Grinning Face U+1F600
Code point: 0x1F600
Offset: 0x1F600 - 0x10000 = 0xF600
High 10 bits: 0xF600 >> 10 = 0x3D → high surrogate = 0xD800 + 0x3D = 0xD83D
Low 10 bits: 0xF600 & 0x3FF = 0x200 → low surrogate = 0xDC00 + 0x200 = 0xDE00
Surrogate pair: U+D83D U+DE00
Verify the reverse:
(0xD83D - 0xD800) * 0x400 + (0xDE00 - 0xDC00) + 0x10000
= 0x3D * 0x400 + 0x200 + 0x10000
= 0xF400 + 0x200 + 0x10000
= 0x1F600 ✓
Common Supplementary Characters and Their Surrogates
| Character | Code Point | High | Low | Name |
|---|---|---|---|---|
| 😀 | U+1F600 | D83D | DE00 | Grinning Face |
| 🐍 | U+1F40D | D83D | DC0D | Snake |
| 𝕳 | U+1D573 | D835 | DD73 | Mathematical Fraktur Capital H |
| 🎵 | U+1F3B5 | D83C | DFB5 | Musical Note |
| 𐐀 | U+10400 | D801 | DC00 | Deseret Capital Long I |
| 🏴 | U+1F3F4 | D83C | DFF4 | Black Flag |
| 𝄞 | U+1D11E | D834 | DD1E | Musical Symbol G Clef |
Where Surrogate Pairs Appear
Surrogate pairs are a UTF-16 mechanism only. They do not exist in UTF-8 or UTF-32:
| Encoding | How it handles U+1F600 |
|---|---|
| UTF-8 | 4 bytes: F0 9F 98 80 — no surrogates involved |
| UTF-16 | 2 code units: D83D DE00 — surrogate pair |
| UTF-32 | 1 code unit: 0001F600 — direct representation |
Languages and runtimes that use UTF-16 internally include:
- JavaScript — strings are sequences of UTF-16 code units
- Java —
charis 16-bit,Stringstores UTF-16 - C#/.NET —
charis 16-bit UTF-16 - Windows API —
wchar_tis 16-bit, allWfunctions use UTF-16 - Objective-C/Swift —
NSString/Stringuse UTF-16 under the hood - Qt (C++) —
QStringstores UTF-16
Surrogate Pair Bugs in JavaScript
JavaScript is where surrogate pair bugs are most visible, because the language exposes UTF-16 code units directly through the string API.
Bug 1: .length Lies
const grin = "\u{1F600}";
console.log(grin); // 😀
console.log(grin.length); // 2 (two UTF-16 code units)
The .length property counts code units, not characters or code points. Every supplementary
character adds 2 to the length.
Fix: Use the spread operator or Array.from to iterate by code point:
[...grin].length; // 1
Array.from(grin).length; // 1
Bug 2: Indexing Splits the Pair
const snake = "\u{1F40D}";
snake[0]; // "\uD83D" — lone high surrogate (broken)
snake[1]; // "\uDC0D" — lone low surrogate (broken)
Fix: Spread first, then index:
[...snake][0]; // "🐍" — correct
snake.codePointAt(0); // 128013 (0x1F40D) — correct code point
Bug 3: String Truncation Produces Lone Surrogates
const msg = "Hello 😀 World";
const cut = msg.slice(0, 7);
// cut === "Hello \uD83D" — the high surrogate is orphaned
Fix: Use Intl.Segmenter or test for well-formedness:
// ES2024: check and fix lone surrogates
msg.slice(0, 7).isWellFormed(); // false
msg.slice(0, 7).toWellFormed(); // "Hello \uFFFD" — replaced with U+FFFD
Bug 4: Regex Without /u Flag
/^.$/.test("😀"); // false — regex sees two code units
/^.$/u.test("😀"); // true — /u flag enables Unicode-aware matching
Bug 5: charCodeAt vs codePointAt
"😀".charCodeAt(0); // 55357 (0xD83D) — high surrogate, wrong
"😀".codePointAt(0); // 128512 (0x1F600) — correct code point
String.fromCharCode(0x1F600); // gibberish — value overflows 16 bits
String.fromCodePoint(0x1F600); // "😀" — correct
Surrogate Pairs in Java
Java's char type is 16 bits, so supplementary characters require two char values:
String emoji = "\uD83D\uDE00"; // 😀 via surrogate pair
String emoji2 = new String(Character.toChars(0x1F600)); // cleaner
emoji.length(); // 2 (code units)
emoji.codePointCount(0, emoji.length()); // 1 (code points)
emoji.charAt(0); // '\uD83D' — high surrogate
emoji.codePointAt(0); // 128512 (0x1F600) — correct
Best practices in Java:
- Use
codePointAt()instead ofcharAt()for character access - Use
codePointCount()instead oflength()for character counting - Use
codePoints()stream for iteration:
"Hello 😀".codePoints().forEach(cp ->
System.out.println(Character.toString(cp))
);
Surrogate Pairs in Python
Python 3 strings are sequences of code points, not code units. This means Python handles supplementary characters correctly by default:
emoji = "\U0001F600"
len(emoji) # 1 — correct, Python counts code points
emoji[0] # '😀' — no surrogate pair exposure
ord(emoji) # 128512 — correct code point value
However, surrogates can still appear when working with external data:
# Encoding to UTF-16 exposes surrogates at the byte level
"😀".encode("utf-16-le") # b'=\xd8\x00\xde' — D83D DE00 in little-endian
# The surrogatepass error handler allows encoding/decoding lone surrogates
b"\xed\xa0\x80".decode("utf-8", errors="surrogatepass") # '\ud800'
Lone Surrogates: The Real Danger
A lone surrogate is a high surrogate without a following low surrogate, or a low surrogate without a preceding high surrogate. Lone surrogates are invalid in all Unicode encoding forms and cause real-world problems:
| Problem | Effect |
|---|---|
| Database storage | PostgreSQL and MySQL in strict mode reject lone surrogates |
| JSON serialization | JSON.stringify may produce invalid JSON with \uD800 escapes |
| File names | NTFS allows lone surrogates in file names, but other systems reject them |
| Network transmission | HTTP/2 and WebSocket frames must be well-formed UTF-8 — lone surrogates crash the connection |
| Security | Attackers use overlong encodings of surrogates to bypass input filters |
Detecting Lone Surrogates
// JavaScript (ES2024)
"hello".isWellFormed(); // true
"\uD800".isWellFormed(); // false
// Regex approach (older environments)
function hasLoneSurrogate(str) {
return /(?:[\uD800-\uDBFF](?![\uDC00-\uDFFF]))|(?:(?<![\uD800-\uDBFF])[\uDC00-\uDFFF])/.test(str);
}
# Python: detect surrogates in a string
import re
def has_surrogates(s: str) -> bool:
return bool(re.search("[\ud800-\udfff]", s))
Summary Table
| Concept | Value |
|---|---|
| Surrogate range | U+D800 – U+DFFF (2,048 code points) |
| High surrogates | U+D800 – U+DBFF (1,024) |
| Low surrogates | U+DC00 – U+DFFF (1,024) |
| Characters encodable | 1,024 x 1,024 = 1,048,576 (planes 1–16) |
| Encoding affected | UTF-16 only |
| Not affected | UTF-8, UTF-32 |
| Used by | JavaScript, Java, C#, Windows API, Qt |
| Not used by | Python 3 (code point strings), Rust, Go (UTF-8 native) |
Best Practices
-
Use code-point-aware APIs:
codePointAt()overcharCodeAt()in JavaScript,codePointAt()overcharAt()in Java. -
Count correctly: Use
[...str].lengthin JavaScript orcodePointCount()in Java instead of.length. -
Never truncate blindly: Check that your truncation point does not land between the two halves of a surrogate pair.
-
Validate input: Reject or replace lone surrogates at your system boundary — when reading files, accepting network input, or parsing JSON.
-
Prefer UTF-8 for storage and transport: UTF-8 does not use surrogates. Converting from UTF-16 to UTF-8 at the boundary eliminates the entire class of surrogate bugs.
-
Use the
/uflag in JavaScript regex: Without it,.matches one code unit and character classes do not understand supplementary characters. -
Test with emoji: The easiest way to catch surrogate-related bugs is to include emoji and other supplementary characters in your test data.
Ещё в Unicode Fundamentals
Unicode is the universal character encoding standard that assigns a unique number …
UTF-8 is the dominant character encoding on the web, capable of representing …
UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …
A Unicode code point is the unique number assigned to each character …
Unicode is divided into 17 planes, each containing up to 65,536 code …
The Byte Order Mark (BOM) is a special Unicode character used at …
ASCII defined 128 characters for the English alphabet and was the foundation …
The same visible character can be represented by multiple different byte sequences …
The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …
Every Unicode character belongs to a general category such as Letter, Number, …
Unicode blocks are contiguous ranges of code points grouped by script or …
Unicode assigns every character to a script property that identifies the writing …
Combining characters are Unicode code points that attach to a preceding base …
A single visible character on screen — called a grapheme cluster — …
Unicode confusables are characters that look identical or nearly identical to others, …
Zero-width characters are invisible Unicode code points that affect text layout, joining, …
Unicode defines over two dozen whitespace characters beyond the ordinary space, including …
Unicode began in 1987 as a collaboration between engineers at Apple and …
Unicode has released major versions regularly since 1.0 in 1991, with each …