Unicode Planes and the BMP
Unicode is divided into 17 planes, each containing up to 65,536 code points, with Plane 0 known as the Basic Multilingual Plane (BMP). This guide explains the structure of Unicode planes, what lives in each one, and why the BMP matters to developers.
Unicode's total code point space — U+0000 through U+10FFFF — is enormous: 1,114,112 possible values. To keep this vast range organized, the standard divides it into 17 contiguous regions called planes, each containing exactly 65,536 (2^16) code points. Most of the characters you encounter every day live in Plane 0, the Basic Multilingual Plane (BMP), but the supplementary planes hold everything from emoji and ancient scripts to rare CJK ideographs and special-purpose control characters. This guide walks through all 17 planes, explains what they contain, and shows how the plane structure affects encoding and programming.
Overview of All 17 Planes
| Plane | Range | Abbreviation | Name |
|---|---|---|---|
| 0 | U+0000–U+FFFF | BMP | Basic Multilingual Plane |
| 1 | U+10000–U+1FFFF | SMP | Supplementary Multilingual Plane |
| 2 | U+20000–U+2FFFF | SIP | Supplementary Ideographic Plane |
| 3 | U+30000–U+3FFFF | TIP | Tertiary Ideographic Plane |
| 4 | U+40000–U+4FFFF | — | Unassigned |
| 5 | U+50000–U+5FFFF | — | Unassigned |
| 6 | U+60000–U+6FFFF | — | Unassigned |
| 7 | U+70000–U+7FFFF | — | Unassigned |
| 8 | U+80000–U+8FFFF | — | Unassigned |
| 9 | U+90000–U+9FFFF | — | Unassigned |
| 10 | U+A0000–U+AFFFF | — | Unassigned |
| 11 | U+B0000–U+BFFFF | — | Unassigned |
| 12 | U+C0000–U+CFFFF | — | Unassigned |
| 13 | U+D0000–U+DFFFF | — | Unassigned |
| 14 | U+E0000–U+EFFFF | SSP | Supplementary Special-purpose Plane |
| 15 | U+F0000–U+FFFFF | SPUA-A | Supplementary Private Use Area A |
| 16 | U+100000–U+10FFFF | SPUA-B | Supplementary Private Use Area B |
Planes 4 through 13 are entirely unassigned — ten empty planes held in reserve for future growth. With approximately 154,998 characters assigned (as of Unicode 16.0) out of 1,114,112 total slots, there is no shortage of room.
Plane 0: Basic Multilingual Plane (BMP)
The BMP is the most densely populated and most important plane. It covers U+0000 through U+FFFF and holds the characters needed for virtually every modern writing system in active use.
What's in the BMP
| Block Range | Content |
|---|---|
| U+0000–U+007F | Basic Latin (ASCII) — English letters, digits, common punctuation |
| U+0080–U+00FF | Latin-1 Supplement — accented letters for Western European languages |
| U+0100–U+024F | Latin Extended-A/B — additional Latin letters for Eastern European, Vietnamese |
| U+0370–U+03FF | Greek and Coptic |
| U+0400–U+04FF | Cyrillic |
| U+0530–U+058F | Armenian |
| U+0590–U+05FF | Hebrew |
| U+0600–U+06FF | Arabic |
| U+0900–U+097F | Devanagari (Hindi, Sanskrit, Marathi) |
| U+0E00–U+0E7F | Thai |
| U+1100–U+11FF | Hangul Jamo (Korean) |
| U+2000–U+206F | General Punctuation (em dash, ellipsis, zero-width characters) |
| U+2100–U+214F | Letterlike Symbols (℃, ℮, №) |
| U+2190–U+21FF | Arrows (←, →, ↑, ↓) |
| U+2200–U+22FF | Mathematical Operators (∀, ∃, ∞, ≤, ≥) |
| U+2600–U+26FF | Miscellaneous Symbols (☀, ☎, ♠, ♣) |
| U+3000–U+303F | CJK Symbols and Punctuation |
| U+3040–U+309F | Hiragana |
| U+30A0–U+30FF | Katakana |
| U+4E00–U+9FFF | CJK Unified Ideographs — the 20,992 most common Chinese/Japanese/Korean characters |
| U+AC00–U+D7AF | Hangul Syllables — 11,172 precomposed Korean syllables |
| U+D800–U+DFFF | Surrogates (not characters — used by UTF-16) |
| U+E000–U+F8FF | Private Use Area (6,400 slots for application-specific characters) |
| U+F900–U+FAFF | CJK Compatibility Ideographs |
| U+FE00–U+FE0F | Variation Selectors (text vs. emoji presentation) |
| U+FF00–U+FFEF | Halfwidth and Fullwidth Forms |
| U+FFF0–U+FFFF | Specials (includes U+FFFD REPLACEMENT CHARACTER and U+FEFF BOM) |
The BMP alone covers Latin, Greek, Cyrillic, Arabic, Hebrew, Devanagari, Thai, Chinese, Japanese, Korean, and dozens of other scripts — enough for the vast majority of modern text.
The Surrogate Gap
The range U+D800 through U+DFFF (2,048 code points) is permanently reserved for UTF-16 surrogates. These are not characters and will never be assigned. They exist solely as a mechanism for UTF-16 to encode code points above U+FFFF using pairs of 16-bit code units.
This means the BMP has a theoretical capacity of 65,536 code points but an effective capacity of 63,488 (65,536 minus 2,048 surrogates).
Plane 1: Supplementary Multilingual Plane (SMP)
The SMP (U+10000–U+1FFFF) is the second most important plane and the one that has grown fastest in recent Unicode versions, largely thanks to emoji.
Key contents
| Block Range | Content |
|---|---|
| U+10000–U+1007F | Linear B Syllabary |
| U+10080–U+100FF | Linear B Ideograms |
| U+10300–U+1032F | Old Italic |
| U+10330–U+1034F | Gothic |
| U+10400–U+1044F | Deseret |
| U+10800–U+1083F | Cypriot Syllabary |
| U+12000–U+1237F | Cuneiform |
| U+13000–U+1342F | Egyptian Hieroglyphs |
| U+16800–U+16A3F | Bamum Supplement |
| U+1D100–U+1D1FF | Musical Symbols (𝄞, 𝄢) |
| U+1D400–U+1D7FF | Mathematical Alphanumeric Symbols (𝐀, 𝑩, 𝒞) |
| U+1F300–U+1F5FF | Miscellaneous Symbols and Pictographs (🌍, 🏠, 🔥) |
| U+1F600–U+1F64F | Emoticons (😀, 😂, 😍) |
| U+1F680–U+1F6FF | Transport and Map Symbols (🚀, 🚗, 🚂) |
| U+1F900–U+1F9FF | Supplemental Symbols and Pictographs (🤖, 🧠, 🦊) |
| U+1FA00–U+1FA6F | Chess Symbols |
| U+1FA70–U+1FAFF | Symbols and Pictographs Extended-A (🪐, 🫠) |
The SMP is where you find emoji, historic and archaic scripts (Cuneiform, Egyptian Hieroglyphs, Old Italic, Linear B), musical notation, and mathematical letter variants (bold, italic, script, fraktur, double-struck alphabets used in formal mathematics).
Plane 2: Supplementary Ideographic Plane (SIP)
The SIP (U+20000–U+2FFFF) exists to extend the CJK Unified Ideographs beyond what the BMP can hold. As of Unicode 16.0, it contains over 60,000 characters across several extension blocks:
| Block | Range | Characters |
|---|---|---|
| CJK Unified Ideographs Extension B | U+20000–U+2A6DF | 42,720 |
| CJK Unified Ideographs Extension C | U+2A700–U+2B73F | 4,154 |
| CJK Unified Ideographs Extension D | U+2B740–U+2B81F | 222 |
| CJK Unified Ideographs Extension E | U+2B820–U+2CEAF | 5,762 |
| CJK Unified Ideographs Extension F | U+2CEB0–U+2EBEF | 7,473 |
| CJK Compatibility Ideographs Supplement | U+2F800–U+2FA1F | 542 |
These are rare or historical CJK characters — personal names, place names, variant forms, and characters from classical texts. Most everyday Chinese, Japanese, or Korean text uses only the BMP characters, but scholarly work and government databases frequently need SIP characters.
Plane 3: Tertiary Ideographic Plane (TIP)
The TIP (U+30000–U+3FFFF) was introduced in Unicode 13.0 (2020) for even rarer CJK ideographs:
| Block | Range | Characters |
|---|---|---|
| CJK Unified Ideographs Extension G | U+30000–U+3134F | 4,939 |
| CJK Unified Ideographs Extension H | U+31350–U+323AF | 4,192 |
| CJK Unified Ideographs Extension I | U+2EBF0–U+2F7FF | 622 |
These extensions accommodate characters needed for historical texts, regional variants, and comprehensive dictionaries.
Planes 4–13: Unassigned (Reserved for Future)
Ten full planes — 655,360 code points — are completely empty. They serve as a buffer ensuring that Unicode can accommodate future needs without running out of space. No characters are currently planned for these planes, and any future allocation would go through the standard Unicode proposal and review process.
Plane 14: Supplementary Special-purpose Plane (SSP)
The SSP (U+E0000–U+EFFFF) contains two specialized blocks:
| Block | Range | Purpose |
|---|---|---|
| Tags | U+E0001–U+E007F | Language tags (deprecated in favor of higher-level protocols) |
| Variation Selectors Supplement | U+E0100–U+E01EF | 240 additional variation selectors for CJK ideograph variants |
The Tags block was originally designed to embed language information in plain text (e.g.,
tagging a run of text as English or Japanese). This use was deprecated because language tagging
is better handled by markup (HTML lang attribute) or text protocols. However, these tag
characters found new life in emoji flag sequences — regional indicator sequences use them
to form subdivision flags (like the flag of Scotland 🏴 or Texas 🏴).
The Variation Selectors Supplement provides standardized variation sequences for CJK ideographs, allowing fonts to select specific glyph variants for characters that have multiple accepted visual forms.
Planes 15–16: Private Use Area Planes
| Plane | Range | Code Points |
|---|---|---|
| 15 (SPUA-A) | U+F0000–U+FFFFF | 65,534 |
| 16 (SPUA-B) | U+100000–U+10FFFD | 65,534 |
These two planes are entirely designated as Private Use Area space, extending the 6,400 BMP PUA code points with an additional 131,068 slots. Organizations, communities, and software vendors can assign any meaning to these code points for internal use — but the assignments are not interoperable without prior agreement between parties.
The ConScript Unicode Registry (CSUR) maintains informal assignments in these planes for constructed scripts (Klingon, Tengwar, Cirth) and other community projects.
How Planes Affect Encoding
The plane structure has direct consequences for the three Unicode encoding forms:
UTF-8
| Plane | Bytes per Character |
|---|---|
| BMP (Plane 0) | 1–3 bytes |
| All other planes (1–16) | 4 bytes |
For BMP characters, UTF-8 uses one byte for ASCII (U+0000–U+007F), two bytes for U+0080–U+07FF, and three bytes for U+0800–U+FFFF. Characters outside the BMP always require four bytes.
UTF-16
| Plane | Code Units |
|---|---|
| BMP (Plane 0) | 1 code unit (2 bytes) |
| All other planes (1–16) | 2 code units / surrogate pair (4 bytes) |
UTF-16 was designed around the BMP. When Unicode expanded beyond 65,536 code points, the surrogate mechanism was added to handle planes 1–16. Each non-BMP character is encoded as a pair of 16-bit surrogates.
UTF-32
Every code point takes exactly 4 bytes regardless of plane. This makes random access trivial but wastes space for text that is primarily BMP characters.
Programming Implications
Detecting the Plane
You can determine a code point's plane by examining its numerical value:
def get_plane(code_point: int) -> int:
# Return the Unicode plane number (0-16) for a code point.
return code_point >> 16
# Examples
print(get_plane(0x0041)) # 0 — BMP
print(get_plane(0x1F600)) # 1 — SMP
print(get_plane(0x20000)) # 2 — SIP
print(get_plane(0xE0001)) # 14 — SSP
print(get_plane(0x100000)) # 16 — SPUA-B
JavaScript String Length Pitfall
Because JavaScript strings use UTF-16 internally, characters outside the BMP (planes 1–16) appear as two code units:
// BMP character — 1 code unit
"A".length // 1
"中".length // 1
// SMP character — 2 code units (surrogate pair)
"😀".length // 2
"𝐀".length // 2 (U+1D400, Mathematical Bold Capital A)
// Correct code-point-aware counting
[..."😀"].length // 1
This is one of the most common bugs in JavaScript string processing. Any code that uses
.length, .charAt(), .charCodeAt(), or .slice() on strings containing non-BMP characters
can produce incorrect results. Use the spread operator, for...of, or codePointAt() instead.
Database Storage
If your database column uses UTF-16 (like SQL Server's NVARCHAR), non-BMP characters
consume 4 bytes instead of 2. MySQL's utf8 charset only supports the BMP — you need utf8mb4
to store characters from planes 1–16 (including all emoji).
-- MySQL: use utf8mb4 for full Unicode support
CREATE TABLE messages (
id INT PRIMARY KEY,
content TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
);
Key Takeaways
- Unicode divides its 1,114,112 code points into 17 planes of 65,536 code points each.
- Plane 0 (BMP) holds the vast majority of commonly used characters across all modern writing systems.
- Plane 1 (SMP) contains emoji, historic scripts, and mathematical letter variants.
- Planes 2–3 (SIP, TIP) hold rare and historical CJK ideographs.
- Plane 14 (SSP) provides variation selectors and (deprecated) language tags.
- Planes 15–16 are Private Use Areas for application-specific characters.
- Planes 4–13 are entirely unassigned, reserved for future growth.
- The plane boundary matters for encoding: non-BMP characters require 4 bytes in UTF-8, surrogate
pairs in UTF-16, and can cause
.lengthbugs in JavaScript. - Use MySQL
utf8mb4(notutf8) to support all Unicode planes.
Unicode Fundamentals içinde daha fazlası
Unicode is the universal character encoding standard that assigns a unique number …
UTF-8 is the dominant character encoding on the web, capable of representing …
UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …
A Unicode code point is the unique number assigned to each character …
The Byte Order Mark (BOM) is a special Unicode character used at …
Surrogate pairs are a mechanism in UTF-16 that allows code points outside …
ASCII defined 128 characters for the English alphabet and was the foundation …
The same visible character can be represented by multiple different byte sequences …
The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …
Every Unicode character belongs to a general category such as Letter, Number, …
Unicode blocks are contiguous ranges of code points grouped by script or …
Unicode assigns every character to a script property that identifies the writing …
Combining characters are Unicode code points that attach to a preceding base …
A single visible character on screen — called a grapheme cluster — …
Unicode confusables are characters that look identical or nearly identical to others, …
Zero-width characters are invisible Unicode code points that affect text layout, joining, …
Unicode defines over two dozen whitespace characters beyond the ordinary space, including …
Unicode began in 1987 as a collaboration between engineers at Apple and …
Unicode has released major versions regularly since 1.0 in 1991, with each …