U+ UnicodeFYI
यूनिकोड मानक

यूनिकोड स्केलर मान

कोई भी कोड पॉइंट सिवाय सरोगेट कोड पॉइंट्स (U+D800–U+DFFF) के। वास्तविक वर्णों को निरूपित करने वाले मान्य मानों का सेट, कुल 1,112,064।

· अपडेट किया गया

What is a Unicode Scalar Value?

A Unicode scalar value is any Unicode code point except the 2,048 surrogate code points (U+D800–U+DFFF). In other words, scalar values are all code points in the ranges U+0000–U+D7FF and U+E000–U+10FFFF, totaling 1,112,064 values.

The concept was introduced because UTF-8 and UTF-32 cannot encode surrogate code points — they are only meaningful in UTF-16 as part of surrogate pairs. A value that can validly be encoded in all three UTF forms is called a scalar value.

The Rust programming language uses char to represent exactly a Unicode scalar value, making this concept central to Rust's string semantics.

Why Exclude Surrogates?

In UTF-16, surrogates only have meaning as paired code units. A lone surrogate (not part of a valid pair) represents a malformed sequence. Including surrogates in "valid Unicode scalars" would mean applications must handle them as standalone code points — but there is nothing meaningful they could represent.

By defining scalar values as the set of code points that can be independently encoded, Unicode avoids the ambiguity of "what does a lone surrogate mean outside UTF-16?"

All code points:       U+0000–U+10FFFF   = 1,114,112
Surrogate code points: U+D800–U+DFFF     =     2,048
Unicode scalar values: all minus surrogates = 1,112,064

Scalar Values in Rust

Rust's char type is defined as "a Unicode scalar value" — exactly 4 bytes, guaranteed to be in the range U+0000–U+D7FF or U+E000–U+10FFFF:

// Rust: char = Unicode scalar value
let a: char = 'A';               // U+0041
let emoji: char = '😀';          // U+1F600
let max: char = '\u{10FFFF}';    // Maximum scalar value

// This would be a compile error (surrogates are not scalar values):
// let bad: char = '\u{D800}';   // ERROR: not a valid Unicode scalar

// u32::from(char) always gives a valid scalar value
println!("{}", u32::from(emoji)); // 128512

Scalar Values in Swift

Swift's Unicode.Scalar type mirrors Rust's concept:

let scalar = Unicode.Scalar("A")!
print(scalar.value)   // 65

// Surrogate values are rejected:
let invalid = Unicode.Scalar(0xD800)  // nil — not a scalar value

Scalar Values in Python

Python 3's str type is a sequence of Unicode code points. Technically, Python allows lone surrogates in strings (using \ud800 etc.), but they are not proper scalar values:

# Python allows surrogates in strings but they are problematic
s = "\uD800"  # lone high surrogate
print(s.encode("utf-8"))   # raises UnicodeEncodeError — not a scalar value

# Proper scalar values encode without error
s = "😀"
print(s.encode("utf-8"))   # b'\xf0\x9f\x98\x80' — fine

Scalar Values vs Characters

Unicode scalar values are a subset of code points, but they are still not identical to user-perceived characters (grapheme clusters). A single grapheme cluster like 🏳️‍🌈 (rainbow flag) consists of four scalar values:

U+1F3F3  WHITE FLAG
U+FE0F   VARIATION SELECTOR-16
U+200D   ZERO WIDTH JOINER
U+1F308  RAINBOW

All four are valid scalar values, but together they form one grapheme cluster.

Quick Facts

Property Value
Definition All code points except surrogates
Range U+0000–U+D7FF and U+E000–U+10FFFF
Total count 1,112,064
Excluded Surrogates U+D800–U+DFFF (2,048 values)
Rust type char (exactly a Unicode scalar value)
Swift type Unicode.Scalar
Encodable in UTF-8? Yes — all scalar values have valid UTF-8 encodings
Encodable in UTF-32? Yes
Encodable in UTF-16? Yes (with surrogate pairs for supplementary scalars)

संबंधित शब्द

यूनिकोड मानक में और

CJK (चीनी, जापानी, कोरियाई)

Chinese, Japanese, और Korean — Unicode में unified Han ideograph block और …

Han Unification

The process of mapping Chinese, Japanese, and Korean ideographs that share a …

Hangul Jamo

The individual consonant and vowel components (jamo) of the Korean Hangul writing …

ISO 10646 / सार्वभौमिक वर्ण समूह

अंतर्राष्ट्रीय मानक (ISO/IEC 10646) जो Unicode के साथ समकालिक है, समान अक्षर …

Unicode

सार्वभौमिक अक्षर एन्कोडिंग मानक जो हर लिखित प्रणाली के हर अक्षर को …

Unicode Standard Annex (UAX)

Normative or informative documents that are integral parts of the Unicode Standard. …

Unicode Technical Report (UTR)

Informational documents published by the Unicode Consortium covering specific topics like security …

अनसाइन्ड कोड बिंदु

एक कोड पॉइंट जिसे अभी तक किसी भी Unicode संस्करण में वर्ण …

अमूर्त वर्ण

पाठ्य डेटा को व्यवस्थित करने, नियंत्रित करने या निरूपित करने के लिए …

असाइन किया गया वर्ण

एक code point जिसे Unicode संस्करण में अक्षर पदनाम दिया गया है। …