परिवर्तनशील-लंबाई वाली Unicode एन्कोडिंग जो 2 या 4 bytes (16 bits के 1 या 2 code units) का उपयोग करती है। Java, JavaScript और Windows द्वारा आंतरिक रूप से उपयोग की जाती है।

What is सरोगेट?

Code points U+D800–U+DFFF जो विशेष रूप से UTF-16 surrogate pairs के लिए आरक्षित हैं। वैध Unicode scalar values नहीं हैं और स्वतंत्र अक्षरों के रूप में कभी प्रकट नहीं होने चाहिए।

What is कोड इकाई?

एन्कोडिंग की न्यूनतम इकाई: UTF-8 में 8-bit byte, UTF-16 में 16-bit word, UTF-32 में 32-bit word। एक अक्षर के लिए कई code units की आवश्यकता हो सकती है।

What is पूरक तल?

Planes 1–16 (U+10000–U+10FFFF), जिसमें emoji, ऐतिहासिक लिपियां, CJK extensions और संगीत notation शामिल हैं। UTF-16 में surrogate pairs की आवश्यकता होती है।

प्रोग्रामिंग और विकास

सरोगेट जोड़ी

दो 16-bit code units (एक high surrogate U+D800–U+DBFF + low surrogate U+DC00–U+DFFF) जो मिलकर UTF-16 में एक supplementary character को एनकोड करते हैं। 😀 = D83D DE00।

2024-03-18 · Updated 2024-12-12

What Is a Surrogate Pair?

A surrogate pair is a pair of 16-bit code units in UTF-16 encoding that together represent a single Unicode character with a code point above U+FFFF (the supplementary planes). UTF-16 can directly represent the 65,536 code points of the Basic Multilingual Plane (U+0000–U+FFFF) using single 16-bit values. For the remaining ~1 million code points (U+10000–U+10FFFF), it uses two 16-bit values called a high surrogate (U+D800–U+DBFF) and a low surrogate (U+DC00–U+DFFF).

The surrogate range (U+D800–U+DFFF) is reserved exclusively for this purpose and is not a valid Unicode scalar value in any other context.

The Encoding Algorithm

To encode supplementary code point U+XXXXX as a surrogate pair:

Subtract 0x10000 from the code point, giving a 20-bit value V (range 0x00000–0xFFFFF).
High surrogate = 0xD800 + (V >> 10) — top 10 bits.
Low surrogate = 0xDC00 + (V & 0x3FF) — bottom 10 bits.

def to_surrogate_pair(code_point: int) -> tuple[int, int]:
    assert 0x10000 <= code_point <= 0x10FFFF
    v = code_point - 0x10000
    high = 0xD800 + (v >> 10)
    low  = 0xDC00 + (v & 0x3FF)
    return high, low

to_surrogate_pair(0x1F600)  # (0xD83D, 0xDE00) → ("\uD83D", "\uDE00")

def from_surrogate_pair(high: int, low: int) -> int:
    return ((high - 0xD800) << 10) + (low - 0xDC00) + 0x10000

from_surrogate_pair(0xD83D, 0xDE00)  # 0x1F600 = 128512 = 😀

JavaScript and Surrogates

JavaScript strings are UTF-16, so supplementary characters appear as surrogate pairs in the string's internal representation:

const emoji = "😀";
emoji.length;               // 2 (two UTF-16 code units)
emoji.charCodeAt(0);        // 55357 = 0xD83D (high surrogate)
emoji.charCodeAt(1);        // 56832 = 0xDE00 (low surrogate)

// Safe code point access
emoji.codePointAt(0);       // 128512 = 0x1F600 ✓
emoji.codePointAt(1);       // 56832 = 0xDE00 (low surrogate alone — danger!)

// Correct iteration (ES6)
[...emoji].length;          // 1
for (const char of emoji) {
  console.log(char);        // "😀" as one unit
}

// Splitting can break surrogate pairs!
emoji.slice(0, 1);          // "\uD83D" — broken high surrogate!
[...emoji].slice(0, 1).join(""); // "😀" — safe

Lone Surrogates (Invalid)

A high surrogate without a following low surrogate (or vice versa) is an unpaired surrogate or lone surrogate. This is technically invalid UTF-16 and causes problems:

"\uD83D".length;       // 1 — lone high surrogate, invalid
"\uD83D" + "x";        // broken string — "x" follows an unpaired surrogate

// encodeURIComponent throws for lone surrogates
try {
  encodeURIComponent("\uD83D");
} catch(e) {
  console.log("URIError: lone surrogate");
}

Python's UTF-16 codec rejects lone surrogates by default:

b"\xD8\x3D".decode("utf-16-le")
# UnicodeDecodeError: 'utf-16-le' codec can't decode bytes: ...

UTF-8, UTF-32, and Python

UTF-8 and UTF-32 do not use surrogates. Each code point is directly encoded:

# Python str — no surrogates; uses code points directly
"😀".encode("utf-8")   # b"\xf0\x9f\x98\x80" (4 bytes)
"😀".encode("utf-32")  # b"\xff\xfe\x00\x00\x00\xf6\x01\x00" (BOM + 4 bytes)

# ord() returns the code point, never a surrogate
ord("😀")  # 128512 = 0x1F600

# Only UTF-16 encodes as surrogates:
"😀".encode("utf-16-le")  # b"\x3d\xd8\x00\xde"

Quick Facts

Property	Value
High surrogate range	U+D800–U+DBFF (1,024 values)
Low surrogate range	U+DC00–U+DFFF (1,024 values)
Total combinations	1,024 × 1,024 = 1,048,576 (covers all supplementary code points)
Languages using UTF-16	JavaScript, Java, C#, Windows APIs
Valid surrogates	Always in pairs; lone surrogates are ill-formed
Python	No surrogates — `str` uses code points directly
UTF-8/32	No surrogates needed; encode supplementary chars directly