What is คู่ตัวแทน?

หน่วยโค้ด 16 บิตสองตัว (high surrogate U+D800–U+DBFF + low surrogate U+DC00–U+DFFF) ที่เข้ารหัสอักขระเสริมใน UTF-16 ร่วมกัน 😀 = D83D DE00

การเข้ารหัส Unicode แบบความยาวแปรผันที่ใช้ 2 หรือ 4 ไบต์ (1 หรือ 2 หน่วยรหัส 16 บิต) ใช้ภายในโดย Java, JavaScript และ Windows

What is จุดรหัส?

ค่าตัวเลขในพื้นที่รหัส Unicode (U+0000 ถึง U+10FFFF) เขียนในรูปแบบ U+XXXX ไม่ใช่ทุกจุดรหัสที่จะถูกกำหนดให้กับอักขระ

มาตรฐาน Unicode

ตัวแทน

จุดรหัส U+D800–U+DFFF ที่สงวนไว้เฉพาะสำหรับคู่ surrogate ของ UTF-16 ไม่ใช่ค่าสเกลาร์ Unicode ที่ถูกต้องและไม่ควรปรากฏเป็นอักขระเดี่ยวๆ

2021-08-30 · Updated 2024-08-28

What is a Surrogate?

Surrogates are 2,048 code points in the range U+D800–U+DFFF that UTF-16 uses as a mechanism to encode supplementary characters (those above U+FFFF). They are permanently reserved for this purpose and do not represent any characters themselves. The Unicode Standard categorizes surrogates with General Category Cs (Surrogate).

There are two types: - High surrogates (U+D800–U+DBFF): 1,024 code points, the first half of a pair - Low surrogates (U+DC00–U+DFFF): 1,024 code points, the second half of a pair

How Surrogate Pairs Work

A supplementary character (code point U+10000–U+10FFFF) cannot fit in a single UTF-16 code unit (16 bits). UTF-16 encodes it as two consecutive 16-bit code units — a surrogate pair — using the following algorithm:

Encoding U+1F600 (😀 GRINNING FACE):

1. Subtract 0x10000:         0x1F600 - 0x10000 = 0x0F600
2. Convert to 20 bits:       0000 1111 0110 0000 0000
3. High 10 bits:             0000 1111 01  = 0x3D
4. Low 10 bits:              00 0110 0000  = 0x200 (wait, let me redo)
   Actually: 0x0F600 = 0000 1111 0110 0000 0000
   High 10 bits (bits 19-10): 0000 1111 01 = 0x3D
   Low 10 bits  (bits  9- 0): 10 0110 0000 = ...

   High surrogate: 0xD800 + 0x3D = 0xD83D
   Low surrogate:  0xDC00 + 0x200 = 0xDE00

UTF-16: D83D DE00

Decoding:
  H = 0xD83D - 0xD800 = 0x3D   (high 10 bits)
  L = 0xDE00 - 0xDC00 = 0x200  (low 10 bits)
  Code point = ((H << 10) | L) + 0x10000
             = (0x3D << 10 | 0x200) + 0x10000
             = 0xF600 + 0x10000 = 0x1F600

Why This Range Was Chosen

The range U+D800–U+DFFF was deliberately left unoccupied in the original Unicode character assignments. The BMP originally had no characters assigned there, making it safe to repurpose for the surrogate mechanism when UTF-16 was designed. This is why the surrogate range contains no characters — it was a pre-planned gap.

Surrogates in Programming Languages

JavaScript/ECMAScript: JavaScript strings are sequences of UTF-16 code units. Surrogate pairs are common for emoji:

const emoji = "😀";
console.log(emoji.length);         // 2 (two UTF-16 code units)
console.log(emoji.charCodeAt(0));  // 55357 = 0xD83D (high surrogate)
console.log(emoji.charCodeAt(1));  // 56832 = 0xDE00 (low surrogate)
console.log(emoji.codePointAt(0)); // 128512 = 0x1F600 (correct code point)

// Safe iteration (code points, not code units)
for (const char of emoji) {
  console.log(char.codePointAt(0)); // 128512
}

Python: Python 3 strings are sequences of Unicode code points, not UTF-16 code units. You rarely see raw surrogates, but they can appear when interoperating with Windows APIs:

# Lone surrogates are technically invalid but Python can handle them with surrogatepass
b = b'\xd8\x3d'  # High surrogate bytes in UTF-16
try:
    s = b.decode("utf-16-be")
    print(repr(s))  # '\ud83d' — surrogate character
except UnicodeDecodeError:
    pass

Java: Java uses UTF-16 internally; String.charAt() returns code units including surrogates:

String emoji = "\uD83D\uDE00";  // 😀 as surrogate pair
char high = emoji.charAt(0);    // 0xD83D (high surrogate)
char low  = emoji.charAt(1);    // 0xDE00 (low surrogate)
int cp = emoji.codePointAt(0);  // 128512 (correct)

Lone Surrogates

A lone surrogate is a high or low surrogate code unit that is not paired with its complement. Lone surrogates produce ill-formed UTF-16 sequences and are problematic:

They cannot be losslessly converted to UTF-8
They indicate data corruption or programming errors
The Unicode Standard's WTF-8 encoding was proposed to handle them in legacy scenarios

import unicodedata
print(unicodedata.category("\uD800"))  # Cs — surrogate (lone)

Quick Facts

Property	Value
High surrogate range	U+D800–U+DBFF (1,024 code points)
Low surrogate range	U+DC00–U+DFFF (1,024 code points)
Total surrogates	2,048
General category	Cs
Purpose	Encode supplementary characters in UTF-16
Valid in UTF-8?	No — ill-formed
Valid in UTF-32?	Technically no (but often tolerated)
Languages using UTF-16 internally	JavaScript, Java, C#, Windows

คำศัพท์ที่เกี่ยวข้อง

คู่ตัวแทน UTF-16 จุดรหัส

เพิ่มเติมใน มาตรฐาน Unicode

Basic Multilingual Plane (BMP)

ระนาบ 0 (U+0000–U+FFFF) ประกอบด้วยอักขระที่ใช้บ่อยที่สุด ได้แก่ Latin, Greek, Cyrillic, CJK, Arabic และสัญลักษณ์ส่วนใหญ่ อักขระในระนาบนี้พอดีกับหนึ่งหน่วยรหัส …

CJK

จีน ญี่ปุ่น และเกาหลี คำรวมสำหรับบล็อกอักษรจีน Han ที่รวมกันและอักษรที่เกี่ยวข้องใน Unicode CJK Unified Ideographs มีอักขระมากกว่า 20,992 …

Han Unification

The process of mapping Chinese, Japanese, and Korean ideographs that share a …

Hangul Jamo

The individual consonant and vowel components (jamo) of the Korean Hangul writing …

ISO 10646 / Universal Character Set

มาตรฐานสากล (ISO/IEC 10646) ที่ซิงโครไนซ์กับ Unicode กำหนดชุดอักขระและจุดรหัสเดียวกัน แต่ไม่มีอัลกอริธึมและคุณสมบัติเพิ่มเติมของ Unicode

Unicode

มาตรฐานการเข้ารหัสอักขระสากลที่กำหนดหมายเลขเฉพาะ (จุดรหัส) ให้กับทุกอักขระในทุกระบบการเขียน เวอร์ชัน 16.0 มีอักขระที่กำหนดแล้ว 154,998 ตัว

Unicode Character Database (UCD)

คอลเลกชันไฟล์ข้อมูลที่อ่านได้ด้วยเครื่องซึ่งกำหนดคุณสมบัติอักขระ Unicode ทั้งหมด รวมถึง UnicodeData.txt, Blocks.txt, Scripts.txt และอื่นๆ

Unicode Standard Annex (UAX)

Normative or informative documents that are integral parts of the Unicode Standard. …

Unicode Technical Report (UTR)

Informational documents published by the Unicode Consortium covering specific topics like security …

ค่าสเกลาร์ Unicode

จุดรหัสใดๆ ยกเว้นจุดรหัส surrogate (U+D800–U+DFFF) ชุดค่าที่ถูกต้องซึ่งสามารถแทนอักขระจริงได้ รวมทั้งสิ้น 1,112,064 ค่า

← กลับไปยังอภิธานศัพท์