码位
Unicode码空间(U+0000至U+10FFFF)中的数值,写作U+XXXX,并非所有码位都已分配字符。
What is a Code Point?
A code point is the fundamental unit of the Unicode standard: a unique integer assigned to a
single character, symbol, or abstract entity. Code points are written in the format U+XXXX where
the Xs are hexadecimal digits — for example, U+0041 for the Latin capital letter A, or U+1F600
for the grinning face emoji 😀.
The Unicode code space spans from U+0000 to U+10FFFF, providing 1,114,112 possible positions.
Not every position is occupied — as of Unicode 16.0, approximately 154,998 are assigned. The rest
are either unassigned, reserved, or permanently set aside as noncharacters.
Anatomy of a Code Point Notation
U+1F600
│ │────┘
│ └── Hexadecimal value (6 digits for supplementary, 4 for BMP)
└───── "U+" prefix (Unicode notation)
Decimal equivalent: 128,512
Binary: 1 1111 0110 0000 0000
The "U+" prefix is a notation convention; it is not part of the value itself. When code points appear in source code or data, they use encoding-specific escape sequences instead:
| Language | Escape syntax | Example (U+1F600) |
|---|---|---|
| Python | \U0001F600 |
"\U0001F600" |
| JavaScript | \u{1F600} |
"\u{1F600}" |
| Java | surrogate pair | "\uD83D\uDE00" |
| CSS | \1F600 |
content: "\1F600" |
| HTML | 😀 |
😀 |
| Rust | \u{1F600} |
'\u{1F600}' |
Code Points vs Characters
A code point is not always identical to what a user perceives as a "character" (called a grapheme cluster). Consider:
é can be represented as:
U+00E9 (LATIN SMALL LETTER E WITH ACUTE — single code point)
U+0065 + U+0301 (e + combining acute accent — two code points)
🏳️🌈 (rainbow flag) is:
U+1F3F3 + U+FE0F + U+200D + U+1F308 (four code points, one visible character)
This distinction matters for string length calculations, cursor movement, and text editing.
Ranges and Planes
Code points are organized into 17 planes of 65,536 values each:
- Plane 0 (U+0000–U+FFFF): Basic Multilingual Plane — most everyday characters
- Plane 1 (U+10000–U+1FFFF): Supplementary Multilingual Plane — historic scripts, emoji
- Plane 2 (U+20000–U+2FFFF): Supplementary Ideographic Plane — CJK extension ideographs
- Planes 3–13: Mostly unassigned
- Plane 14 (U+E0000–U+EFFFF): Tags (language tags, variation selectors)
- Planes 15–16 (U+F0000–U+10FFFF): Private Use Areas
Common Pitfalls
String length confusion: In many languages, length returns the number of code units (bytes
or 16-bit words), not code points, and neither matches the number of visible characters.
s = "😀"
len(s) # Python 3: 1 (correct — one code point)
len(s.encode()) # 4 (bytes in UTF-8)
"😀".length // 2 (UTF-16 surrogate pair!)
[..."😀"].length // 1 (iterating by code point)
BMP assumption: Legacy code that assumes all characters fit in 16 bits (a BMP-only assumption) breaks on emoji, historic scripts, and rare CJK extensions.
Quick Facts
| Property | Value |
|---|---|
| Notation | U+XXXX (4–6 hex digits) |
| Minimum | U+0000 (NULL) |
| Maximum | U+10FFFF |
| Total possible | 1,114,112 |
| Assigned (v16.0) | ~154,998 |
| First emoji | U+00AE ® (registered sign, Unicode 1.1) |
| Highest assigned emoji | U+1FAE8 (shaking face, v15.0) |
相关术语
Unicode 标准 中的更多内容
中日韩——Unicode中统一汉字区块及相关文字系统的统称,CJK统一表意文字包含20,992个以上字符。
The process of mapping Chinese, Japanese, and Korean ideographs that share a …
The individual consonant and vowel components (jamo) of the Korean Hangul writing …
与Unicode同步的国际标准(ISO/IEC 10646),定义相同的字符集和码位,但不包含Unicode额外的算法和属性。
为每种书写系统中的每个字符分配唯一编号(码位)的通用字符编码标准,16.0版本包含154,998个已分配字符。
Normative or informative documents that are integral parts of the Unicode Standard. …
Informational documents published by the Unicode Consortium covering specific topics like security …
定义所有Unicode字符属性的机器可读数据文件集合,包括UnicodeData.txt、Blocks.txt、Scripts.txt等。
除代理码位(U+D800–U+DFFF)之外的所有码位,是可表示实际字符的有效值集合,共1,112,064个。
Unicode标准的主要版本,每次发布均新增字符、文字系统和功能,当前版本为Unicode 16.0(2025年9月)。