16비트 코드 단위 1개 또는 2개(2바이트 또는 4바이트)를 사용하는 가변 길이 유니코드 인코딩. Java, JavaScript, Windows 내부에서 사용됩니다.

What is 서로게이트?

UTF-16 서로게이트 쌍 전용으로 예약된 코드 포인트 U+D800~U+DFFF. 유효한 유니코드 스칼라 값이 아니므로 독립 문자로 나타나서는 안 됩니다.

What is 코드 단위?

인코딩의 최소 단위: UTF-8에서는 8비트 바이트, UTF-16에서는 16비트 워드, UTF-32에서는 32비트 워드. 하나의 문자가 여러 코드 단위를 필요로 할 수 있습니다.

What is 보충 평면 / 아스트랄 평면?

평면 1~16(U+10000~U+10FFFF)으로, 이모지, 고대 문자, CJK 확장, 악보 등을 포함합니다. UTF-16에서는 서로게이트 쌍이 필요합니다.

프로그래밍 & 개발

서로게이트 쌍

UTF-16에서 보충 문자를 인코딩하기 위해 함께 사용되는 두 개의 16비트 코드 단위(상위 서로게이트 U+D800~U+DBFF + 하위 서로게이트 U+DC00~U+DFFF). 😀 = D83D DE00.

2024-03-18 · Updated 2024-12-12

What Is a Surrogate Pair?

A surrogate pair is a pair of 16-bit code units in UTF-16 encoding that together represent a single Unicode character with a code point above U+FFFF (the supplementary planes). UTF-16 can directly represent the 65,536 code points of the Basic Multilingual Plane (U+0000–U+FFFF) using single 16-bit values. For the remaining ~1 million code points (U+10000–U+10FFFF), it uses two 16-bit values called a high surrogate (U+D800–U+DBFF) and a low surrogate (U+DC00–U+DFFF).

The surrogate range (U+D800–U+DFFF) is reserved exclusively for this purpose and is not a valid Unicode scalar value in any other context.

The Encoding Algorithm

To encode supplementary code point U+XXXXX as a surrogate pair:

Subtract 0x10000 from the code point, giving a 20-bit value V (range 0x00000–0xFFFFF).
High surrogate = 0xD800 + (V >> 10) — top 10 bits.
Low surrogate = 0xDC00 + (V & 0x3FF) — bottom 10 bits.

def to_surrogate_pair(code_point: int) -> tuple[int, int]:
    assert 0x10000 <= code_point <= 0x10FFFF
    v = code_point - 0x10000
    high = 0xD800 + (v >> 10)
    low  = 0xDC00 + (v & 0x3FF)
    return high, low

to_surrogate_pair(0x1F600)  # (0xD83D, 0xDE00) → ("\uD83D", "\uDE00")

def from_surrogate_pair(high: int, low: int) -> int:
    return ((high - 0xD800) << 10) + (low - 0xDC00) + 0x10000

from_surrogate_pair(0xD83D, 0xDE00)  # 0x1F600 = 128512 = 😀

JavaScript and Surrogates

JavaScript strings are UTF-16, so supplementary characters appear as surrogate pairs in the string's internal representation:

const emoji = "😀";
emoji.length;               // 2 (two UTF-16 code units)
emoji.charCodeAt(0);        // 55357 = 0xD83D (high surrogate)
emoji.charCodeAt(1);        // 56832 = 0xDE00 (low surrogate)

// Safe code point access
emoji.codePointAt(0);       // 128512 = 0x1F600 ✓
emoji.codePointAt(1);       // 56832 = 0xDE00 (low surrogate alone — danger!)

// Correct iteration (ES6)
[...emoji].length;          // 1
for (const char of emoji) {
  console.log(char);        // "😀" as one unit
}

// Splitting can break surrogate pairs!
emoji.slice(0, 1);          // "\uD83D" — broken high surrogate!
[...emoji].slice(0, 1).join(""); // "😀" — safe

Lone Surrogates (Invalid)

A high surrogate without a following low surrogate (or vice versa) is an unpaired surrogate or lone surrogate. This is technically invalid UTF-16 and causes problems:

"\uD83D".length;       // 1 — lone high surrogate, invalid
"\uD83D" + "x";        // broken string — "x" follows an unpaired surrogate

// encodeURIComponent throws for lone surrogates
try {
  encodeURIComponent("\uD83D");
} catch(e) {
  console.log("URIError: lone surrogate");
}

Python's UTF-16 codec rejects lone surrogates by default:

b"\xD8\x3D".decode("utf-16-le")
# UnicodeDecodeError: 'utf-16-le' codec can't decode bytes: ...

UTF-8, UTF-32, and Python

UTF-8 and UTF-32 do not use surrogates. Each code point is directly encoded:

# Python str — no surrogates; uses code points directly
"😀".encode("utf-8")   # b"\xf0\x9f\x98\x80" (4 bytes)
"😀".encode("utf-32")  # b"\xff\xfe\x00\x00\x00\xf6\x01\x00" (BOM + 4 bytes)

# ord() returns the code point, never a surrogate
ord("😀")  # 128512 = 0x1F600

# Only UTF-16 encodes as surrogates:
"😀".encode("utf-16-le")  # b"\x3d\xd8\x00\xde"

Quick Facts

Property	Value
High surrogate range	U+D800–U+DBFF (1,024 values)
Low surrogate range	U+DC00–U+DFFF (1,024 values)
Total combinations	1,024 × 1,024 = 1,048,576 (covers all supplementary code points)
Languages using UTF-16	JavaScript, Java, C#, Windows APIs
Valid surrogates	Always in pairs; lone surrogates are ill-formed
Python	No surrogates — `str` uses code points directly
UTF-8/32	No surrogates needed; encode supplementary chars directly