码空间
所有可能的Unicode码位范围:U+0000至U+10FFFF(共1,114,112个),分为17个平面,每个平面含65,536个码位。
What is the Unicode Code Space?
The code space is the complete range of integer values available for Unicode code points: U+0000 through U+10FFFF, totaling exactly 1,114,112 positions. Think of it as the full address space of the Unicode standard — every character, symbol, and abstract entity that could ever be assigned a Unicode code point must fit within this range.
The code space is not fully occupied. As of Unicode 16.0, approximately 154,998 of these positions are assigned to characters. The remaining ~959,000 positions are either unassigned (available for future characters), reserved, or permanently designated as noncharacters or private use.
The Numbers
U+0000 → U+10FFFF
0 → 1,114,111
Total code points: 1,114,112
= 17 planes × 65,536 points per plane
= 17 × 0x10000
= 0x110000 (hex)
= 2^21 - 2^16 = ? (not a round power of 2 — see below)
The value 1,114,112 is not a round power of two. It equals 17 × 65,536, which results from the deliberate choice to have 17 planes of 65,536 each. The upper limit of U+10FFFF was set to match the maximum value expressible by UTF-16 surrogate pairs, making UTF-16 the natural encoding boundary.
Why U+10FFFF as the Upper Limit?
UTF-16 uses surrogate pairs to encode supplementary characters. Each surrogate half occupies a 10-bit value, so a pair provides 20 bits of additional addressing: 2^20 = 1,048,576 supplementary code points. Adding the 65,536 BMP positions yields exactly 1,114,112 — the size of the code space. The U+10FFFF upper limit was thus engineered to keep UTF-16 and the code space in perfect alignment.
Code Space Breakdown
| Category | Count (approx.) | Notes |
|---|---|---|
| Assigned characters | 154,998 | Unicode 16.0 |
| Private Use Area | 137,468 | U+E000–U+F8FF, U+F0000–U+FFFFF, U+100000–U+10FFFF |
| Surrogates (reserved) | 2,048 | U+D800–U+DFFF — never characters |
| Noncharacters | 66 | 32 at end of each plane + 34 in Arabic Presentation Forms-A |
| Unassigned | ~819,000 | Available for future Unicode versions |
Code Space vs Character Repertoire
The code space defines positions; the character repertoire is the subset of those positions that are currently assigned. Unicode's stability policies ensure that once a character is assigned to a position, that assignment is permanent — no code point is ever recycled or reassigned to a different character.
# Python: check if a code point is within the Unicode code space
def is_valid_code_point(cp: int) -> bool:
return 0x0000 <= cp <= 0x10FFFF
# Check for surrogate range (not real characters)
def is_surrogate(cp: int) -> bool:
return 0xD800 <= cp <= 0xDFFF
# Check for noncharacter
def is_noncharacter(cp: int) -> bool:
last_two = cp & 0xFFFF
return last_two in (0xFFFE, 0xFFFF) or 0xFDD0 <= cp <= 0xFDEF
Historical Context
The original Unicode proposal (1988) envisioned a 16-bit code space of 65,536 characters.
Engineers believed this would be sufficient for all world languages. By Unicode 2.0 (1996) it was
clear the CJK ideograph extensions alone would exceed this limit. The standard was extended to
21 bits (the current code space), but the legacy 16-bit assumption is why surrogate pairs exist
in UTF-16 and why JavaScript's String.length counts UTF-16 code units rather than Unicode
code points.
Quick Facts
| Property | Value |
|---|---|
| Minimum | U+0000 |
| Maximum | U+10FFFF |
| Total positions | 1,114,112 |
| Assigned (v16.0) | ~154,998 (13.9%) |
| Private use | 137,468 |
| Surrogates (permanently reserved) | 2,048 |
| Noncharacters | 66 |
| Bit width required | 21 bits |
| UTF-16 coverage | Exactly matches code space upper bound |
相关术语
Unicode 标准 中的更多内容
中日韩——Unicode中统一汉字区块及相关文字系统的统称,CJK统一表意文字包含20,992个以上字符。
The process of mapping Chinese, Japanese, and Korean ideographs that share a …
The individual consonant and vowel components (jamo) of the Korean Hangul writing …
与Unicode同步的国际标准(ISO/IEC 10646),定义相同的字符集和码位,但不包含Unicode额外的算法和属性。
为每种书写系统中的每个字符分配唯一编号(码位)的通用字符编码标准,16.0版本包含154,998个已分配字符。
Normative or informative documents that are integral parts of the Unicode Standard. …
Informational documents published by the Unicode Consortium covering specific topics like security …
定义所有Unicode字符属性的机器可读数据文件集合,包括UnicodeData.txt、Blocks.txt、Scripts.txt等。
除代理码位(U+D800–U+DFFF)之外的所有码位,是可表示实际字符的有效值集合,共1,112,064个。
Unicode标准的主要版本,每次发布均新增字符、文字系统和功能,当前版本为Unicode 16.0(2025年9月)。