代理项
专为UTF-16代理对保留的码位U+D800–U+DFFF,不是有效的Unicode标量值,不应作为独立字符出现。
What is a Surrogate?
Surrogates are 2,048 code points in the range U+D800–U+DFFF that UTF-16 uses as a mechanism to encode supplementary characters (those above U+FFFF). They are permanently reserved for this purpose and do not represent any characters themselves. The Unicode Standard categorizes surrogates with General Category Cs (Surrogate).
There are two types: - High surrogates (U+D800–U+DBFF): 1,024 code points, the first half of a pair - Low surrogates (U+DC00–U+DFFF): 1,024 code points, the second half of a pair
How Surrogate Pairs Work
A supplementary character (code point U+10000–U+10FFFF) cannot fit in a single UTF-16 code unit (16 bits). UTF-16 encodes it as two consecutive 16-bit code units — a surrogate pair — using the following algorithm:
Encoding U+1F600 (😀 GRINNING FACE):
1. Subtract 0x10000: 0x1F600 - 0x10000 = 0x0F600
2. Convert to 20 bits: 0000 1111 0110 0000 0000
3. High 10 bits: 0000 1111 01 = 0x3D
4. Low 10 bits: 00 0110 0000 = 0x200 (wait, let me redo)
Actually: 0x0F600 = 0000 1111 0110 0000 0000
High 10 bits (bits 19-10): 0000 1111 01 = 0x3D
Low 10 bits (bits 9- 0): 10 0110 0000 = ...
High surrogate: 0xD800 + 0x3D = 0xD83D
Low surrogate: 0xDC00 + 0x200 = 0xDE00
UTF-16: D83D DE00
Decoding:
H = 0xD83D - 0xD800 = 0x3D (high 10 bits)
L = 0xDE00 - 0xDC00 = 0x200 (low 10 bits)
Code point = ((H << 10) | L) + 0x10000
= (0x3D << 10 | 0x200) + 0x10000
= 0xF600 + 0x10000 = 0x1F600
Why This Range Was Chosen
The range U+D800–U+DFFF was deliberately left unoccupied in the original Unicode character assignments. The BMP originally had no characters assigned there, making it safe to repurpose for the surrogate mechanism when UTF-16 was designed. This is why the surrogate range contains no characters — it was a pre-planned gap.
Surrogates in Programming Languages
JavaScript/ECMAScript: JavaScript strings are sequences of UTF-16 code units. Surrogate pairs are common for emoji:
const emoji = "😀";
console.log(emoji.length); // 2 (two UTF-16 code units)
console.log(emoji.charCodeAt(0)); // 55357 = 0xD83D (high surrogate)
console.log(emoji.charCodeAt(1)); // 56832 = 0xDE00 (low surrogate)
console.log(emoji.codePointAt(0)); // 128512 = 0x1F600 (correct code point)
// Safe iteration (code points, not code units)
for (const char of emoji) {
console.log(char.codePointAt(0)); // 128512
}
Python: Python 3 strings are sequences of Unicode code points, not UTF-16 code units. You rarely see raw surrogates, but they can appear when interoperating with Windows APIs:
# Lone surrogates are technically invalid but Python can handle them with surrogatepass
b = b'\xd8\x3d' # High surrogate bytes in UTF-16
try:
s = b.decode("utf-16-be")
print(repr(s)) # '\ud83d' — surrogate character
except UnicodeDecodeError:
pass
Java: Java uses UTF-16 internally; String.charAt() returns code units including surrogates:
String emoji = "\uD83D\uDE00"; // 😀 as surrogate pair
char high = emoji.charAt(0); // 0xD83D (high surrogate)
char low = emoji.charAt(1); // 0xDE00 (low surrogate)
int cp = emoji.codePointAt(0); // 128512 (correct)
Lone Surrogates
A lone surrogate is a high or low surrogate code unit that is not paired with its complement. Lone surrogates produce ill-formed UTF-16 sequences and are problematic:
- They cannot be losslessly converted to UTF-8
- They indicate data corruption or programming errors
- The Unicode Standard's WTF-8 encoding was proposed to handle them in legacy scenarios
import unicodedata
print(unicodedata.category("\uD800")) # Cs — surrogate (lone)
Quick Facts
| Property | Value |
|---|---|
| High surrogate range | U+D800–U+DBFF (1,024 code points) |
| Low surrogate range | U+DC00–U+DFFF (1,024 code points) |
| Total surrogates | 2,048 |
| General category | Cs |
| Purpose | Encode supplementary characters in UTF-16 |
| Valid in UTF-8? | No — ill-formed |
| Valid in UTF-32? | Technically no (but often tolerated) |
| Languages using UTF-16 internally | JavaScript, Java, C#, Windows |
相关术语
Unicode 标准 中的更多内容
中日韩——Unicode中统一汉字区块及相关文字系统的统称,CJK统一表意文字包含20,992个以上字符。
The process of mapping Chinese, Japanese, and Korean ideographs that share a …
The individual consonant and vowel components (jamo) of the Korean Hangul writing …
与Unicode同步的国际标准(ISO/IEC 10646),定义相同的字符集和码位,但不包含Unicode额外的算法和属性。
为每种书写系统中的每个字符分配唯一编号(码位)的通用字符编码标准,16.0版本包含154,998个已分配字符。
Normative or informative documents that are integral parts of the Unicode Standard. …
Informational documents published by the Unicode Consortium covering specific topics like security …
定义所有Unicode字符属性的机器可读数据文件集合,包括UnicodeData.txt、Blocks.txt、Scripts.txt等。
除代理码位(U+D800–U+DFFF)之外的所有码位,是可表示实际字符的有效值集合,共1,112,064个。
Unicode标准的主要版本,每次发布均新增字符、文字系统和功能,当前版本为Unicode 16.0(2025年9月)。