What is ตัวแทน?

จุดรหัส U+D800–U+DFFF ที่สงวนไว้เฉพาะสำหรับคู่ surrogate ของ UTF-16 ไม่ใช่ค่าสเกลาร์ Unicode ที่ถูกต้องและไม่ควรปรากฏเป็นอักขระเดี่ยวๆ

มาตรฐาน Unicode

ค่าสเกลาร์ Unicode

จุดรหัสใดๆ ยกเว้นจุดรหัส surrogate (U+D800–U+DFFF) ชุดค่าที่ถูกต้องซึ่งสามารถแทนอักขระจริงได้ รวมทั้งสิ้น 1,112,064 ค่า

2021-10-04 · อัปเดต 2024-07-09

What is a Unicode Scalar Value?

A Unicode scalar value is any Unicode code point except the 2,048 surrogate code points (U+D800–U+DFFF). In other words, scalar values are all code points in the ranges U+0000–U+D7FF and U+E000–U+10FFFF, totaling 1,112,064 values.

The concept was introduced because UTF-8 and UTF-32 cannot encode surrogate code points — they are only meaningful in UTF-16 as part of surrogate pairs. A value that can validly be encoded in all three UTF forms is called a scalar value.

The Rust programming language uses char to represent exactly a Unicode scalar value, making this concept central to Rust's string semantics.

Why Exclude Surrogates?

In UTF-16, surrogates only have meaning as paired code units. A lone surrogate (not part of a valid pair) represents a malformed sequence. Including surrogates in "valid Unicode scalars" would mean applications must handle them as standalone code points — but there is nothing meaningful they could represent.

By defining scalar values as the set of code points that can be independently encoded, Unicode avoids the ambiguity of "what does a lone surrogate mean outside UTF-16?"

All code points:       U+0000–U+10FFFF   = 1,114,112
Surrogate code points: U+D800–U+DFFF     =     2,048
Unicode scalar values: all minus surrogates = 1,112,064

Scalar Values in Rust

Rust's char type is defined as "a Unicode scalar value" — exactly 4 bytes, guaranteed to be in the range U+0000–U+D7FF or U+E000–U+10FFFF:

// Rust: char = Unicode scalar value
let a: char = 'A';               // U+0041
let emoji: char = '😀';          // U+1F600
let max: char = '\u{10FFFF}';    // Maximum scalar value

// This would be a compile error (surrogates are not scalar values):
// let bad: char = '\u{D800}';   // ERROR: not a valid Unicode scalar

// u32::from(char) always gives a valid scalar value
println!("{}", u32::from(emoji)); // 128512

Scalar Values in Swift

Swift's Unicode.Scalar type mirrors Rust's concept:

let scalar = Unicode.Scalar("A")!
print(scalar.value)   // 65

// Surrogate values are rejected:
let invalid = Unicode.Scalar(0xD800)  // nil — not a scalar value

Scalar Values in Python

Python 3's str type is a sequence of Unicode code points. Technically, Python allows lone surrogates in strings (using \ud800 etc.), but they are not proper scalar values:

# Python allows surrogates in strings but they are problematic
s = "\uD800"  # lone high surrogate
print(s.encode("utf-8"))   # raises UnicodeEncodeError — not a scalar value

# Proper scalar values encode without error
s = "😀"
print(s.encode("utf-8"))   # b'\xf0\x9f\x98\x80' — fine

Scalar Values vs Characters

Unicode scalar values are a subset of code points, but they are still not identical to user-perceived characters (grapheme clusters). A single grapheme cluster like 🏳️‍🌈 (rainbow flag) consists of four scalar values:

U+1F3F3  WHITE FLAG
U+FE0F   VARIATION SELECTOR-16
U+200D   ZERO WIDTH JOINER
U+1F308  RAINBOW

All four are valid scalar values, but together they form one grapheme cluster.

Quick Facts

Property	Value
Definition	All code points except surrogates
Range	U+0000–U+D7FF and U+E000–U+10FFFF
Total count	1,112,064
Excluded	Surrogates U+D800–U+DFFF (2,048 values)
Rust type	`char` (exactly a Unicode scalar value)
Swift type	`Unicode.Scalar`
Encodable in UTF-8?	Yes — all scalar values have valid UTF-8 encodings
Encodable in UTF-32?	Yes
Encodable in UTF-16?	Yes (with surrogate pairs for supplementary scalars)

คำศัพท์ที่เกี่ยวข้อง

จุดรหัส ตัวแทน

เพิ่มเติมใน มาตรฐาน Unicode

CJK (จีน ญี่ปุ่น เกาหลี)

จีน ญี่ปุ่น และเกาหลี คำรวมสำหรับบล็อกอักษรจีน Han ที่รวมกันและอักษรที่เกี่ยวข้องใน Unicode CJK Unified Ideographs มีอักขระมากกว่า 20,992 …

Han Unification

The process of mapping Chinese, Japanese, and Korean ideographs that share a …

Hangul Jamo

The individual consonant and vowel components (jamo) of the Korean Hangul writing …

ISO 10646 / ชุดอักขระสากล

มาตรฐานสากล (ISO/IEC 10646) ที่ซิงโครไนซ์กับ Unicode กำหนดชุดอักขระและจุดรหัสเดียวกัน แต่ไม่มีอัลกอริธึมและคุณสมบัติเพิ่มเติมของ Unicode

Unicode

มาตรฐานการเข้ารหัสอักขระสากลที่กำหนดหมายเลขเฉพาะ (จุดรหัส) ให้กับทุกอักขระในทุกระบบการเขียน เวอร์ชัน 16.0 มีอักขระที่กำหนดแล้ว 154,998 ตัว

Unicode Standard Annex (UAX)

Normative or informative documents that are integral parts of the Unicode Standard. …

Unicode Technical Report (UTR)

Informational documents published by the Unicode Consortium covering specific topics like security …

จุดรหัส

ค่าตัวเลขในพื้นที่รหัส Unicode (U+0000 ถึง U+10FFFF) เขียนในรูปแบบ U+XXXX ไม่ใช่ทุกจุดรหัสที่จะถูกกำหนดให้กับอักขระ

จุดรหัสที่ยังไม่ได้กำหนด

จุดรหัสที่ยังไม่ได้กำหนดอักขระในเวอร์ชัน Unicode ใดๆ จัดอยู่ในหมวด Cn (ยังไม่กำหนด) อาจถูกกำหนดในเวอร์ชันอนาคต

จุดรหัสที่สงวนไว้

จุดรหัสที่กันไว้สำหรับการกำหนดมาตรฐานในอนาคต แตกต่างจาก noncharacter (สงวนไว้ถาวร) และพื้นที่ private use (กำหนดได้โดยผู้ใช้)

← กลับไปยังอภิธานศัพท์