सरोगेट जोड़ी
दो 16-bit code units (एक high surrogate U+D800–U+DBFF + low surrogate U+DC00–U+DFFF) जो मिलकर UTF-16 में एक supplementary character को एनकोड करते हैं। 😀 = D83D DE00।
What Is a Surrogate Pair?
A surrogate pair is a pair of 16-bit code units in UTF-16 encoding that together represent a single Unicode character with a code point above U+FFFF (the supplementary planes). UTF-16 can directly represent the 65,536 code points of the Basic Multilingual Plane (U+0000–U+FFFF) using single 16-bit values. For the remaining ~1 million code points (U+10000–U+10FFFF), it uses two 16-bit values called a high surrogate (U+D800–U+DBFF) and a low surrogate (U+DC00–U+DFFF).
The surrogate range (U+D800–U+DFFF) is reserved exclusively for this purpose and is not a valid Unicode scalar value in any other context.
The Encoding Algorithm
To encode supplementary code point U+XXXXX as a surrogate pair:
- Subtract
0x10000from the code point, giving a 20-bit valueV(range 0x00000–0xFFFFF). - High surrogate =
0xD800 + (V >> 10)— top 10 bits. - Low surrogate =
0xDC00 + (V & 0x3FF)— bottom 10 bits.
def to_surrogate_pair(code_point: int) -> tuple[int, int]:
assert 0x10000 <= code_point <= 0x10FFFF
v = code_point - 0x10000
high = 0xD800 + (v >> 10)
low = 0xDC00 + (v & 0x3FF)
return high, low
to_surrogate_pair(0x1F600) # (0xD83D, 0xDE00) → ("\uD83D", "\uDE00")
def from_surrogate_pair(high: int, low: int) -> int:
return ((high - 0xD800) << 10) + (low - 0xDC00) + 0x10000
from_surrogate_pair(0xD83D, 0xDE00) # 0x1F600 = 128512 = 😀
JavaScript and Surrogates
JavaScript strings are UTF-16, so supplementary characters appear as surrogate pairs in the string's internal representation:
const emoji = "😀";
emoji.length; // 2 (two UTF-16 code units)
emoji.charCodeAt(0); // 55357 = 0xD83D (high surrogate)
emoji.charCodeAt(1); // 56832 = 0xDE00 (low surrogate)
// Safe code point access
emoji.codePointAt(0); // 128512 = 0x1F600 ✓
emoji.codePointAt(1); // 56832 = 0xDE00 (low surrogate alone — danger!)
// Correct iteration (ES6)
[...emoji].length; // 1
for (const char of emoji) {
console.log(char); // "😀" as one unit
}
// Splitting can break surrogate pairs!
emoji.slice(0, 1); // "\uD83D" — broken high surrogate!
[...emoji].slice(0, 1).join(""); // "😀" — safe
Lone Surrogates (Invalid)
A high surrogate without a following low surrogate (or vice versa) is an unpaired surrogate or lone surrogate. This is technically invalid UTF-16 and causes problems:
"\uD83D".length; // 1 — lone high surrogate, invalid
"\uD83D" + "x"; // broken string — "x" follows an unpaired surrogate
// encodeURIComponent throws for lone surrogates
try {
encodeURIComponent("\uD83D");
} catch(e) {
console.log("URIError: lone surrogate");
}
Python's UTF-16 codec rejects lone surrogates by default:
b"\xD8\x3D".decode("utf-16-le")
# UnicodeDecodeError: 'utf-16-le' codec can't decode bytes: ...
UTF-8, UTF-32, and Python
UTF-8 and UTF-32 do not use surrogates. Each code point is directly encoded:
# Python str — no surrogates; uses code points directly
"😀".encode("utf-8") # b"\xf0\x9f\x98\x80" (4 bytes)
"😀".encode("utf-32") # b"\xff\xfe\x00\x00\x00\xf6\x01\x00" (BOM + 4 bytes)
# ord() returns the code point, never a surrogate
ord("😀") # 128512 = 0x1F600
# Only UTF-16 encodes as surrogates:
"😀".encode("utf-16-le") # b"\x3d\xd8\x00\xde"
Quick Facts
| Property | Value |
|---|---|
| High surrogate range | U+D800–U+DBFF (1,024 values) |
| Low surrogate range | U+DC00–U+DFFF (1,024 values) |
| Total combinations | 1,024 × 1,024 = 1,048,576 (covers all supplementary code points) |
| Languages using UTF-16 | JavaScript, Java, C#, Windows APIs |
| Valid surrogates | Always in pairs; lone surrogates are ill-formed |
| Python | No surrogates — str uses code points directly |
| UTF-8/32 | No surrogates needed; encode supplementary chars directly |
संबंधित शब्द
प्रोग्रामिंग और विकास में और
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …
गलत encoding से bytes को decode करने के कारण गड़बड़ हुआ टेक्स्ट। …
Python 3 uses Unicode strings by default (str = UTF-8 internally via …
Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …
कोई भी वर्ण जिसका कोई दृश्य ग्लिफ़ नहीं है: whitespace, zero-width वर्ण, …
Encoding वर्णों को bytes में परिवर्तित करता है (str.encode('utf-8')); decoding bytes को …
U+0000 (NUL)। पहला Unicode/ASCII वर्ण, C/C++ में string terminator के रूप में …
U+FFFD (�)। जब decoder अमान्य byte sequences का सामना करता है तो …
सोर्स कोड में Unicode वर्णों को दर्शाने के लिए सिंटैक्स। भाषा के …
Unicode properties का उपयोग करने वाले regex पैटर्न: \p{L} (कोई भी अक्षर), …