การเข้ารหัส Unicode แบบความยาวแปรผันที่ใช้ 2 หรือ 4 ไบต์ (1 หรือ 2 หน่วยรหัส 16 บิต) ใช้ภายในโดย Java, JavaScript และ Windows

What is เพลนหลายภาษาพื้นฐาน (BMP)?

ระนาบ 0 (U+0000–U+FFFF) ประกอบด้วยอักขระที่ใช้บ่อยที่สุด ได้แก่ Latin, Greek, Cyrillic, CJK, Arabic และสัญลักษณ์ส่วนใหญ่ อักขระในระนาบนี้พอดีกับหนึ่งหน่วยรหัส UTF-16

What is คู่ตัวแทน?

หน่วยโค้ด 16 บิตสองตัว (high surrogate U+D800–U+DBFF + low surrogate U+DC00–U+DFFF) ที่เข้ารหัสอักขระเสริมใน UTF-16 ร่วมกัน 😀 = D83D DE00

การเข้ารหัส

UCS-2

การเข้ารหัส 2 ไบต์แบบความยาวคงที่ที่ล้าสมัย ครอบคลุมเฉพาะ BMP (U+0000–U+FFFF) เป็นรุ่นก่อนของ UTF-16 ที่ไม่สามารถแสดงอักขระเสริมได้

2021-03-01 · อัปเดต 2024-04-18

What is UCS-2?

UCS-2 (Universal Character Set coded in 2 octets) is a fixed-width 2-byte character encoding that was the original "wide character" format of the early Unicode standard. It maps each code point directly to a 16-bit integer, covering only the Basic Multilingual Plane (U+0000–U+FFFF) — the first 65,536 Unicode code points.

UCS-2 is now obsolete and has been superseded by UTF-16, but it remains historically significant. Windows NT, Java, and JavaScript were all designed with UCS-2 in mind, and the legacy of that choice continues to shape how modern programmers encounter surrogate pairs and the quirky length behavior of JavaScript strings.

History and Context

In the late 1980s and early 1990s, the Unicode Consortium originally envisioned Unicode as a fixed-width 16-bit encoding: 65,536 positions should be enough for all the world's characters. This was the UCS-2 design. The appeal was simplicity — a fixed 2-byte-per-character encoding eliminates all the complexity of variable-length parsing.

By 1996, it became clear that 65,536 was not enough. Chinese, Japanese, and Korean needed more characters than the BMP could hold, and the Unicode standard expanded to cover over a million code points. UCS-2 became a dead end.

UTF-16 was defined as the backward-compatible successor: for code points in the BMP, UTF-16 and UCS-2 are identical. For supplementary characters (U+10000 and above), UTF-16 adds surrogate pairs — a mechanism UCS-2 has no provision for.

How UCS-2 Works

Each character is stored as a 16-bit big-endian or little-endian integer equal to its Unicode code point value:

Character	Code Point	UCS-2 BE	UCS-2 LE
A	U+0041	`00 41`	`41 00`
é	U+00E9	`00 E9`	`E9 00`
中	U+4E2D	`4E 2D`	`2D 4E`
😀	U+1F600	NOT REPRESENTABLE	NOT REPRESENTABLE

The 😀 emoji at U+1F600 cannot be encoded in UCS-2 at all. A UCS-2 encoder faced with a supplementary character has no valid output — it must either skip the character, substitute a replacement character (U+FFFD), or throw an error.

Where UCS-2 Left Its Fingerprints

The most widespread legacy of UCS-2 is in JavaScript. ECMAScript was designed in 1995 when Unicode was still expected to fit in 16 bits. JavaScript strings are sequences of UTF-16 code units, but their .length property counts code units, not characters — a direct inheritance from UCS-2 thinking:

// The UCS-2 legacy in JavaScript
'😀'.length;           // 2 — counts 16-bit code units (UCS-2 style)
[...'😀'].length;      // 1 — counts Unicode code points (correct)

// String methods that break on supplementary chars
'😀'.charAt(0);        // '\uD83D' — high surrogate alone
'😀'.charCodeAt(0);    // 55357 (0xD83D) — not the emoji's code point!
'😀'.codePointAt(0);   // 128512 (0x1F600) — correct (ES6+)

Java's char type is similarly a 16-bit UCS-2 unit: a single char cannot hold an emoji code point.

Code Examples

# Python does not directly support UCS-2 as a codec name,
# but UTF-16 is equivalent for BMP characters
text = 'Hello'
ucs2_equivalent = text.encode('utf-16-le')
print(ucs2_equivalent)  # b'H\x00e\x00l\x00l\x00o\x00'

# Supplementary characters fail in genuine UCS-2 contexts
emoji = '😀'
try:
    # 'utf-16' handles surrogates; a true UCS-2 codec would reject this
    emoji.encode('utf-16-le')   # Succeeds as UTF-16 with surrogates
except UnicodeEncodeError:
    print("Cannot encode in UCS-2")

Quick Facts

Property	Value
Full Name	Universal Character Set coded in 2 octets
Code unit size	16 bits (2 bytes)
Characters covered	U+0000–U+FFFF (BMP only)
Supplementary chars	Not supported
Status	Obsolete — superseded by UTF-16
Legacy in	JavaScript `.length`, Java `char`, early Windows NT
Fixed-width	Yes

Common Pitfalls

Treating UTF-16 as UCS-2. In the BMP, UTF-16 and UCS-2 produce identical bytes. Code written for BMP-only text may work fine for years, then suddenly fail when an emoji or rare CJK character appears. The subtle bug: treating surrogate code units (U+D800–U+DFFF) as valid characters when they are only meaningful as surrogate pairs in UTF-16.

JavaScript string length bugs. Any JavaScript code that uses str.length, str.slice(), str.substr(), or str[i] may silently break for strings containing emoji or supplementary characters. The safe modern API uses codePointAt(), spread syntax, or Intl.Segmenter.

Database collation issues. Some older MySQL configurations use utf8 (which is actually a 3-byte encoding, unable to store 4-byte supplementary characters) rather than utf8mb4. This is effectively a database-level UCS-2 limitation in disguise.

คำศัพท์ที่เกี่ยวข้อง

UTF-16 เพลนหลายภาษาพื้นฐาน (BMP) คู่ตัวแทน

เพิ่มเติมใน การเข้ารหัส

ASCII

มาตรฐานรหัสข้อมูลของอเมริกา (American Standard Code for Information Interchange) การเข้ารหัส 7 บิตครอบคลุม 128 ตัวอักษร …

ASCII Art

Visual art created from text characters, originally limited to the 95 printable …

Base64

Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …

Big5

การเข้ารหัสอักษรจีนตัวเต็มที่ใช้ส่วนใหญ่ในไต้หวันและฮ่องกง เข้ารหัสอักขระ CJK ประมาณ 13,000 ตัว

EBCDIC

Extended Binary Coded Decimal Interchange Code รหัสเข้ารหัสของเมนเฟรม IBM ที่มีช่วงตัวอักษรไม่ต่อเนื่อง ยังคงใช้ในธนาคารและเมนเฟรมองค์กร

EUC-KR

การเข้ารหัสอักขระภาษาเกาหลีที่อิงตาม KS X 1001 แมปอักษรฮันกึลและฮันจาเป็นลำดับสองไบต์

GB2312 / GB18030

กลุ่มการเข้ารหัสอักษรจีนตัวย่อ: GB2312 (6,763 อักขระ) พัฒนาเป็น GBK แล้วเป็น GB18030 ซึ่งเป็นมาตรฐานแห่งชาติจีนที่บังคับใช้และเข้ากันได้กับ Unicode

ISO 8859

กลุ่มการเข้ารหัสไบต์เดี่ยว 8 บิตสำหรับกลุ่มภาษาต่างๆ ISO 8859-1 (Latin-1) เป็นพื้นฐานของ 256 จุดรหัสแรกของ Unicode

Shift JIS

การเข้ารหัสอักขระภาษาญี่ปุ่นที่ผสม ASCII/JIS Roman แบบไบต์เดี่ยวกับคันจิ JIS X 0208 แบบสองไบต์ ยังคงใช้งานในระบบญี่ปุ่นรุ่นเก่า

UTF-16

การเข้ารหัส Unicode แบบความยาวแปรผันที่ใช้ 2 หรือ 4 ไบต์ (1 หรือ 2 หน่วยรหัส 16 …

← กลับไปยังอภิธานศัพท์