서로게이트 쌍
UTF-16에서 보충 문자를 인코딩하기 위해 함께 사용되는 두 개의 16비트 코드 단위(상위 서로게이트 U+D800~U+DBFF + 하위 서로게이트 U+DC00~U+DFFF). 😀 = D83D DE00.
What Is a Surrogate Pair?
A surrogate pair is a pair of 16-bit code units in UTF-16 encoding that together represent a single Unicode character with a code point above U+FFFF (the supplementary planes). UTF-16 can directly represent the 65,536 code points of the Basic Multilingual Plane (U+0000–U+FFFF) using single 16-bit values. For the remaining ~1 million code points (U+10000–U+10FFFF), it uses two 16-bit values called a high surrogate (U+D800–U+DBFF) and a low surrogate (U+DC00–U+DFFF).
The surrogate range (U+D800–U+DFFF) is reserved exclusively for this purpose and is not a valid Unicode scalar value in any other context.
The Encoding Algorithm
To encode supplementary code point U+XXXXX as a surrogate pair:
- Subtract
0x10000from the code point, giving a 20-bit valueV(range 0x00000–0xFFFFF). - High surrogate =
0xD800 + (V >> 10)— top 10 bits. - Low surrogate =
0xDC00 + (V & 0x3FF)— bottom 10 bits.
def to_surrogate_pair(code_point: int) -> tuple[int, int]:
assert 0x10000 <= code_point <= 0x10FFFF
v = code_point - 0x10000
high = 0xD800 + (v >> 10)
low = 0xDC00 + (v & 0x3FF)
return high, low
to_surrogate_pair(0x1F600) # (0xD83D, 0xDE00) → ("\uD83D", "\uDE00")
def from_surrogate_pair(high: int, low: int) -> int:
return ((high - 0xD800) << 10) + (low - 0xDC00) + 0x10000
from_surrogate_pair(0xD83D, 0xDE00) # 0x1F600 = 128512 = 😀
JavaScript and Surrogates
JavaScript strings are UTF-16, so supplementary characters appear as surrogate pairs in the string's internal representation:
const emoji = "😀";
emoji.length; // 2 (two UTF-16 code units)
emoji.charCodeAt(0); // 55357 = 0xD83D (high surrogate)
emoji.charCodeAt(1); // 56832 = 0xDE00 (low surrogate)
// Safe code point access
emoji.codePointAt(0); // 128512 = 0x1F600 ✓
emoji.codePointAt(1); // 56832 = 0xDE00 (low surrogate alone — danger!)
// Correct iteration (ES6)
[...emoji].length; // 1
for (const char of emoji) {
console.log(char); // "😀" as one unit
}
// Splitting can break surrogate pairs!
emoji.slice(0, 1); // "\uD83D" — broken high surrogate!
[...emoji].slice(0, 1).join(""); // "😀" — safe
Lone Surrogates (Invalid)
A high surrogate without a following low surrogate (or vice versa) is an unpaired surrogate or lone surrogate. This is technically invalid UTF-16 and causes problems:
"\uD83D".length; // 1 — lone high surrogate, invalid
"\uD83D" + "x"; // broken string — "x" follows an unpaired surrogate
// encodeURIComponent throws for lone surrogates
try {
encodeURIComponent("\uD83D");
} catch(e) {
console.log("URIError: lone surrogate");
}
Python's UTF-16 codec rejects lone surrogates by default:
b"\xD8\x3D".decode("utf-16-le")
# UnicodeDecodeError: 'utf-16-le' codec can't decode bytes: ...
UTF-8, UTF-32, and Python
UTF-8 and UTF-32 do not use surrogates. Each code point is directly encoded:
# Python str — no surrogates; uses code points directly
"😀".encode("utf-8") # b"\xf0\x9f\x98\x80" (4 bytes)
"😀".encode("utf-32") # b"\xff\xfe\x00\x00\x00\xf6\x01\x00" (BOM + 4 bytes)
# ord() returns the code point, never a surrogate
ord("😀") # 128512 = 0x1F600
# Only UTF-16 encodes as surrogates:
"😀".encode("utf-16-le") # b"\x3d\xd8\x00\xde"
Quick Facts
| Property | Value |
|---|---|
| High surrogate range | U+D800–U+DBFF (1,024 values) |
| Low surrogate range | U+DC00–U+DFFF (1,024 values) |
| Total combinations | 1,024 × 1,024 = 1,048,576 (covers all supplementary code points) |
| Languages using UTF-16 | JavaScript, Java, C#, Windows APIs |
| Valid surrogates | Always in pairs; lone surrogates are ill-formed |
| Python | No surrogates — str uses code points directly |
| UTF-8/32 | No surrogates needed; encode supplementary chars directly |
관련 용어
프로그래밍 & 개발의 더 많은 용어
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …
Python 3 uses Unicode strings by default (str = UTF-8 internally via …
Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …
U+0000(NUL). 첫 번째 유니코드/ASCII 문자로, C/C++에서 문자열 종료자로 사용됩니다. 보안 위험: 널 …
U+FFFD(). 디코더가 유효하지 않은 바이트 시퀀스를 만났을 때 표시되는 문자 — '디코딩에 …
잘못된 인코딩으로 바이트를 디코딩할 때 생기는 깨진 텍스트. 일본어 용어(文字化け). 예: 'café'를 …
프로그래밍 언어에서 문자의 시퀀스. 내부 표현은 다양합니다: UTF-8(Go, Rust, 최신 Python), UTF-16(Java, …
유니코드 문자열의 '길이'는 단위에 따라 다릅니다: 코드 단위(JavaScript .length), 코드 포인트(Python len()), …
눈에 보이는 글리프가 없는 문자: 공백, 너비 없는 문자, 제어 문자, 서식 …
소스 코드에서 유니코드 문자를 나타내는 구문. 언어마다 다릅니다: \u2713(Python/Java/JS), \u{2713}(JS/Ruby/Rust), \U00012345(Python/C).