代理项

专为UTF-16代理对保留的码位U+D800–U+DFFF，不是有效的Unicode标量值，不应作为独立字符出现。

2021-08-30 · Updated 2024-08-28

What is a Surrogate?

Surrogates are 2,048 code points in the range U+D800–U+DFFF that UTF-16 uses as a mechanism to encode supplementary characters (those above U+FFFF). They are permanently reserved for this purpose and do not represent any characters themselves. The Unicode Standard categorizes surrogates with General Category Cs (Surrogate).

There are two types: - High surrogates (U+D800–U+DBFF): 1,024 code points, the first half of a pair - Low surrogates (U+DC00–U+DFFF): 1,024 code points, the second half of a pair

How Surrogate Pairs Work

A supplementary character (code point U+10000–U+10FFFF) cannot fit in a single UTF-16 code unit (16 bits). UTF-16 encodes it as two consecutive 16-bit code units — a surrogate pair — using the following algorithm:

Encoding U+1F600 (😀 GRINNING FACE):

1. Subtract 0x10000:         0x1F600 - 0x10000 = 0x0F600
2. Convert to 20 bits:       0000 1111 0110 0000 0000
3. High 10 bits:             0000 1111 01  = 0x3D
4. Low 10 bits:              00 0110 0000  = 0x200 (wait, let me redo)
   Actually: 0x0F600 = 0000 1111 0110 0000 0000
   High 10 bits (bits 19-10): 0000 1111 01 = 0x3D
   Low 10 bits  (bits  9- 0): 10 0110 0000 = ...

   High surrogate: 0xD800 + 0x3D = 0xD83D
   Low surrogate:  0xDC00 + 0x200 = 0xDE00

UTF-16: D83D DE00

Decoding:
  H = 0xD83D - 0xD800 = 0x3D   (high 10 bits)
  L = 0xDE00 - 0xDC00 = 0x200  (low 10 bits)
  Code point = ((H << 10) | L) + 0x10000
             = (0x3D << 10 | 0x200) + 0x10000
             = 0xF600 + 0x10000 = 0x1F600

Why This Range Was Chosen

The range U+D800–U+DFFF was deliberately left unoccupied in the original Unicode character assignments. The BMP originally had no characters assigned there, making it safe to repurpose for the surrogate mechanism when UTF-16 was designed. This is why the surrogate range contains no characters — it was a pre-planned gap.

Surrogates in Programming Languages

JavaScript/ECMAScript: JavaScript strings are sequences of UTF-16 code units. Surrogate pairs are common for emoji:

const emoji = "😀";
console.log(emoji.length);         // 2 (two UTF-16 code units)
console.log(emoji.charCodeAt(0));  // 55357 = 0xD83D (high surrogate)
console.log(emoji.charCodeAt(1));  // 56832 = 0xDE00 (low surrogate)
console.log(emoji.codePointAt(0)); // 128512 = 0x1F600 (correct code point)

// Safe iteration (code points, not code units)
for (const char of emoji) {
  console.log(char.codePointAt(0)); // 128512
}

Python: Python 3 strings are sequences of Unicode code points, not UTF-16 code units. You rarely see raw surrogates, but they can appear when interoperating with Windows APIs:

# Lone surrogates are technically invalid but Python can handle them with surrogatepass
b = b'\xd8\x3d'  # High surrogate bytes in UTF-16
try:
    s = b.decode("utf-16-be")
    print(repr(s))  # '\ud83d' — surrogate character
except UnicodeDecodeError:
    pass

Java: Java uses UTF-16 internally; String.charAt() returns code units including surrogates:

String emoji = "\uD83D\uDE00";  // 😀 as surrogate pair
char high = emoji.charAt(0);    // 0xD83D (high surrogate)
char low  = emoji.charAt(1);    // 0xDE00 (low surrogate)
int cp = emoji.codePointAt(0);  // 128512 (correct)

Lone Surrogates

A lone surrogate is a high or low surrogate code unit that is not paired with its complement. Lone surrogates produce ill-formed UTF-16 sequences and are problematic:

They cannot be losslessly converted to UTF-8
They indicate data corruption or programming errors
The Unicode Standard's WTF-8 encoding was proposed to handle them in legacy scenarios

import unicodedata
print(unicodedata.category("\uD800"))  # Cs — surrogate (lone)

Quick Facts

Property	Value
High surrogate range	U+D800–U+DBFF (1,024 code points)
Low surrogate range	U+DC00–U+DFFF (1,024 code points)
Total surrogates	2,048
General category	Cs
Purpose	Encode supplementary characters in UTF-16
Valid in UTF-8?	No — ill-formed
Valid in UTF-32?	Technically no (but often tolerated)
Languages using UTF-16 internally	JavaScript, Java, C#, Windows