Nilai numerik dalam ruang kode Unicode (U+0000 hingga U+10FFFF), ditulis sebagai U+XXXX. Tidak semua titik kode ditetapkan ke karakter.

Blok berurutan yang terdiri dari 65.536 titik kode. Unicode memiliki 17 bidang (0–16): Bidang 0 adalah BMP, Bidang 1 adalah SMP (emoji, skrip historis), Bidang 2 adalah SIP (ekstensi CJK).

What is Basic Multilingual Plane (BMP)?

Bidang 0 (U+0000–U+FFFF), berisi karakter yang paling umum digunakan termasuk Latin, Yunani, Sirilik, CJK, Arab, dan sebagian besar simbol. Karakter di sini muat dalam satu unit kode UTF-16.

Standar Unicode

Ruang kode

Rentang lengkap titik kode Unicode yang mungkin ada: U+0000 hingga U+10FFFF (total 1.114.112), dibagi menjadi 17 bidang masing-masing berisi 65.536 titik kode.

2021-05-26 · Updated 2024-04-22

What is the Unicode Code Space?

The code space is the complete range of integer values available for Unicode code points: U+0000 through U+10FFFF, totaling exactly 1,114,112 positions. Think of it as the full address space of the Unicode standard — every character, symbol, and abstract entity that could ever be assigned a Unicode code point must fit within this range.

The code space is not fully occupied. As of Unicode 16.0, approximately 154,998 of these positions are assigned to characters. The remaining ~959,000 positions are either unassigned (available for future characters), reserved, or permanently designated as noncharacters or private use.

The Numbers

U+0000      →  U+10FFFF
0           →  1,114,111

Total code points:  1,114,112
= 17 planes × 65,536 points per plane
= 17 × 0x10000
= 0x110000 (hex)
= 2^21 - 2^16 = ? (not a round power of 2 — see below)

The value 1,114,112 is not a round power of two. It equals 17 × 65,536, which results from the deliberate choice to have 17 planes of 65,536 each. The upper limit of U+10FFFF was set to match the maximum value expressible by UTF-16 surrogate pairs, making UTF-16 the natural encoding boundary.

Why U+10FFFF as the Upper Limit?

UTF-16 uses surrogate pairs to encode supplementary characters. Each surrogate half occupies a 10-bit value, so a pair provides 20 bits of additional addressing: 2^20 = 1,048,576 supplementary code points. Adding the 65,536 BMP positions yields exactly 1,114,112 — the size of the code space. The U+10FFFF upper limit was thus engineered to keep UTF-16 and the code space in perfect alignment.

Code Space Breakdown

Category	Count (approx.)	Notes
Assigned characters	154,998	Unicode 16.0
Private Use Area	137,468	U+E000–U+F8FF, U+F0000–U+FFFFF, U+100000–U+10FFFF
Surrogates (reserved)	2,048	U+D800–U+DFFF — never characters
Noncharacters	66	32 at end of each plane + 34 in Arabic Presentation Forms-A
Unassigned	~819,000	Available for future Unicode versions

Code Space vs Character Repertoire

The code space defines positions; the character repertoire is the subset of those positions that are currently assigned. Unicode's stability policies ensure that once a character is assigned to a position, that assignment is permanent — no code point is ever recycled or reassigned to a different character.

# Python: check if a code point is within the Unicode code space
def is_valid_code_point(cp: int) -> bool:
    return 0x0000 <= cp <= 0x10FFFF

# Check for surrogate range (not real characters)
def is_surrogate(cp: int) -> bool:
    return 0xD800 <= cp <= 0xDFFF

# Check for noncharacter
def is_noncharacter(cp: int) -> bool:
    last_two = cp & 0xFFFF
    return last_two in (0xFFFE, 0xFFFF) or 0xFDD0 <= cp <= 0xFDEF

Historical Context

The original Unicode proposal (1988) envisioned a 16-bit code space of 65,536 characters. Engineers believed this would be sufficient for all world languages. By Unicode 2.0 (1996) it was clear the CJK ideograph extensions alone would exceed this limit. The standard was extended to 21 bits (the current code space), but the legacy 16-bit assumption is why surrogate pairs exist in UTF-16 and why JavaScript's String.length counts UTF-16 code units rather than Unicode code points.

Quick Facts

Property	Value
Minimum	U+0000
Maximum	U+10FFFF
Total positions	1,114,112
Assigned (v16.0)	~154,998 (13.9%)
Private use	137,468
Surrogates (permanently reserved)	2,048
Noncharacters	66
Bit width required	21 bits
UTF-16 coverage	Exactly matches code space upper bound

Istilah Terkait

Titik kode Bidang Basic Multilingual Plane (BMP)

Lainnya di Standar Unicode

Area penggunaan pribadi

Rentang yang dicadangkan di mana organisasi dapat menetapkan karakter mereka sendiri: BMP …

Basic Multilingual Plane (BMP)

Bidang 0 (U+0000–U+FFFF), berisi karakter yang paling umum digunakan termasuk Latin, Yunani, …

Bidang

Blok berurutan yang terdiri dari 65.536 titik kode. Unicode memiliki 17 bidang …

Bidang tambahan

Bidang 1–16 (U+10000–U+10FFFF), berisi emoji, skrip historis, ekstensi CJK, dan notasi musik. …

Bukan karakter

Titik kode yang dicadangkan secara permanen untuk penggunaan internal (66 total): U+FDD0–U+FDEF …

CJK

Cina, Jepang, dan Korea — istilah kolektif untuk blok ideograf Han yang …

Han Unification

The process of mapping Chinese, Japanese, and Korean ideographs that share a …

Hangul Jamo

The individual consonant and vowel components (jamo) of the Korean Hangul writing …

ISO 10646 / Universal Character Set

Standar internasional (ISO/IEC 10646) yang disinkronkan dengan Unicode, mendefinisikan repertoar karakter dan …

Karakter abstrak

Unit informasi yang digunakan untuk mengorganisasi, mengontrol, atau merepresentasikan data tekstual — …

← Kembali ke Glosarium