📚 Unicode Fundamentals

Unicode Planes and the BMP

Unicode is divided into 17 planes, each containing up to 65,536 code points, with Plane 0 known as the Basic Multilingual Plane (BMP). This guide explains the structure of Unicode planes, what lives in each one, and why the BMP matters to developers.

Published 2021-05-10 · Updated 2024-06-03

Unicode's total code point space — U+0000 through U+10FFFF — is enormous: 1,114,112 possible values. To keep this vast range organized, the standard divides it into 17 contiguous regions called planes, each containing exactly 65,536 (2^16) code points. Most of the characters you encounter every day live in Plane 0, the Basic Multilingual Plane (BMP), but the supplementary planes hold everything from emoji and ancient scripts to rare CJK ideographs and special-purpose control characters. This guide walks through all 17 planes, explains what they contain, and shows how the plane structure affects encoding and programming.

Overview of All 17 Planes

Plane	Range	Abbreviation	Name
0	U+0000–U+FFFF	BMP	Basic Multilingual Plane
1	U+10000–U+1FFFF	SMP	Supplementary Multilingual Plane
2	U+20000–U+2FFFF	SIP	Supplementary Ideographic Plane
3	U+30000–U+3FFFF	TIP	Tertiary Ideographic Plane
4	U+40000–U+4FFFF	—	Unassigned
5	U+50000–U+5FFFF	—	Unassigned
6	U+60000–U+6FFFF	—	Unassigned
7	U+70000–U+7FFFF	—	Unassigned
8	U+80000–U+8FFFF	—	Unassigned
9	U+90000–U+9FFFF	—	Unassigned
10	U+A0000–U+AFFFF	—	Unassigned
11	U+B0000–U+BFFFF	—	Unassigned
12	U+C0000–U+CFFFF	—	Unassigned
13	U+D0000–U+DFFFF	—	Unassigned
14	U+E0000–U+EFFFF	SSP	Supplementary Special-purpose Plane
15	U+F0000–U+FFFFF	SPUA-A	Supplementary Private Use Area A
16	U+100000–U+10FFFF	SPUA-B	Supplementary Private Use Area B

Planes 4 through 13 are entirely unassigned — ten empty planes held in reserve for future growth. With approximately 154,998 characters assigned (as of Unicode 16.0) out of 1,114,112 total slots, there is no shortage of room.

Plane 0: Basic Multilingual Plane (BMP)

The BMP is the most densely populated and most important plane. It covers U+0000 through U+FFFF and holds the characters needed for virtually every modern writing system in active use.

What's in the BMP

Block Range	Content
U+0000–U+007F	Basic Latin (ASCII) — English letters, digits, common punctuation
U+0080–U+00FF	Latin-1 Supplement — accented letters for Western European languages
U+0100–U+024F	Latin Extended-A/B — additional Latin letters for Eastern European, Vietnamese
U+0370–U+03FF	Greek and Coptic
U+0400–U+04FF	Cyrillic
U+0530–U+058F	Armenian
U+0590–U+05FF	Hebrew
U+0600–U+06FF	Arabic
U+0900–U+097F	Devanagari (Hindi, Sanskrit, Marathi)
U+0E00–U+0E7F	Thai
U+1100–U+11FF	Hangul Jamo (Korean)
U+2000–U+206F	General Punctuation (em dash, ellipsis, zero-width characters)
U+2100–U+214F	Letterlike Symbols (℃, ℮, №)
U+2190–U+21FF	Arrows (←, →, ↑, ↓)
U+2200–U+22FF	Mathematical Operators (∀, ∃, ∞, ≤, ≥)
U+2600–U+26FF	Miscellaneous Symbols (☀, ☎, ♠, ♣)
U+3000–U+303F	CJK Symbols and Punctuation
U+3040–U+309F	Hiragana
U+30A0–U+30FF	Katakana
U+4E00–U+9FFF	CJK Unified Ideographs — the 20,992 most common Chinese/Japanese/Korean characters
U+AC00–U+D7AF	Hangul Syllables — 11,172 precomposed Korean syllables
U+D800–U+DFFF	Surrogates (not characters — used by UTF-16)
U+E000–U+F8FF	Private Use Area (6,400 slots for application-specific characters)
U+F900–U+FAFF	CJK Compatibility Ideographs
U+FE00–U+FE0F	Variation Selectors (text vs. emoji presentation)
U+FF00–U+FFEF	Halfwidth and Fullwidth Forms
U+FFF0–U+FFFF	Specials (includes U+FFFD REPLACEMENT CHARACTER and U+FEFF BOM)

The BMP alone covers Latin, Greek, Cyrillic, Arabic, Hebrew, Devanagari, Thai, Chinese, Japanese, Korean, and dozens of other scripts — enough for the vast majority of modern text.

The Surrogate Gap

The range U+D800 through U+DFFF (2,048 code points) is permanently reserved for UTF-16 surrogates. These are not characters and will never be assigned. They exist solely as a mechanism for UTF-16 to encode code points above U+FFFF using pairs of 16-bit code units.

This means the BMP has a theoretical capacity of 65,536 code points but an effective capacity of 63,488 (65,536 minus 2,048 surrogates).

Plane 1: Supplementary Multilingual Plane (SMP)

The SMP (U+10000–U+1FFFF) is the second most important plane and the one that has grown fastest in recent Unicode versions, largely thanks to emoji.

Key contents

Block Range	Content
U+10000–U+1007F	Linear B Syllabary
U+10080–U+100FF	Linear B Ideograms
U+10300–U+1032F	Old Italic
U+10330–U+1034F	Gothic
U+10400–U+1044F	Deseret
U+10800–U+1083F	Cypriot Syllabary
U+12000–U+1237F	Cuneiform
U+13000–U+1342F	Egyptian Hieroglyphs
U+16800–U+16A3F	Bamum Supplement
U+1D100–U+1D1FF	Musical Symbols (𝄞, 𝄢)
U+1D400–U+1D7FF	Mathematical Alphanumeric Symbols (𝐀, 𝑩, 𝒞)
U+1F300–U+1F5FF	Miscellaneous Symbols and Pictographs (🌍, 🏠, 🔥)
U+1F600–U+1F64F	Emoticons (😀, 😂, 😍)
U+1F680–U+1F6FF	Transport and Map Symbols (🚀, 🚗, 🚂)
U+1F900–U+1F9FF	Supplemental Symbols and Pictographs (🤖, 🧠, 🦊)
U+1FA00–U+1FA6F	Chess Symbols
U+1FA70–U+1FAFF	Symbols and Pictographs Extended-A (🪐, 🫠)

The SMP is where you find emoji, historic and archaic scripts (Cuneiform, Egyptian Hieroglyphs, Old Italic, Linear B), musical notation, and mathematical letter variants (bold, italic, script, fraktur, double-struck alphabets used in formal mathematics).

Plane 2: Supplementary Ideographic Plane (SIP)

The SIP (U+20000–U+2FFFF) exists to extend the CJK Unified Ideographs beyond what the BMP can hold. As of Unicode 16.0, it contains over 60,000 characters across several extension blocks:

Block	Range	Characters
CJK Unified Ideographs Extension B	U+20000–U+2A6DF	42,720
CJK Unified Ideographs Extension C	U+2A700–U+2B73F	4,154
CJK Unified Ideographs Extension D	U+2B740–U+2B81F	222
CJK Unified Ideographs Extension E	U+2B820–U+2CEAF	5,762
CJK Unified Ideographs Extension F	U+2CEB0–U+2EBEF	7,473
CJK Compatibility Ideographs Supplement	U+2F800–U+2FA1F	542

These are rare or historical CJK characters — personal names, place names, variant forms, and characters from classical texts. Most everyday Chinese, Japanese, or Korean text uses only the BMP characters, but scholarly work and government databases frequently need SIP characters.

Plane 3: Tertiary Ideographic Plane (TIP)

The TIP (U+30000–U+3FFFF) was introduced in Unicode 13.0 (2020) for even rarer CJK ideographs:

Block	Range	Characters
CJK Unified Ideographs Extension G	U+30000–U+3134F	4,939
CJK Unified Ideographs Extension H	U+31350–U+323AF	4,192
CJK Unified Ideographs Extension I	U+2EBF0–U+2F7FF	622

These extensions accommodate characters needed for historical texts, regional variants, and comprehensive dictionaries.

Planes 4–13: Unassigned (Reserved for Future)

Ten full planes — 655,360 code points — are completely empty. They serve as a buffer ensuring that Unicode can accommodate future needs without running out of space. No characters are currently planned for these planes, and any future allocation would go through the standard Unicode proposal and review process.

Plane 14: Supplementary Special-purpose Plane (SSP)

The SSP (U+E0000–U+EFFFF) contains two specialized blocks:

Block	Range	Purpose
Tags	U+E0001–U+E007F	Language tags (deprecated in favor of higher-level protocols)
Variation Selectors Supplement	U+E0100–U+E01EF	240 additional variation selectors for CJK ideograph variants

The Tags block was originally designed to embed language information in plain text (e.g., tagging a run of text as English or Japanese). This use was deprecated because language tagging is better handled by markup (HTML lang attribute) or text protocols. However, these tag characters found new life in emoji flag sequences — regional indicator sequences use them to form subdivision flags (like the flag of Scotland 🏴󠁧󠁢󠁳󠁣󠁴󠁿 or Texas 🏴󠁵󠁳󠁴󠁸󠁿).

The Variation Selectors Supplement provides standardized variation sequences for CJK ideographs, allowing fonts to select specific glyph variants for characters that have multiple accepted visual forms.

Planes 15–16: Private Use Area Planes

Plane	Range	Code Points
15 (SPUA-A)	U+F0000–U+FFFFF	65,534
16 (SPUA-B)	U+100000–U+10FFFD	65,534

These two planes are entirely designated as Private Use Area space, extending the 6,400 BMP PUA code points with an additional 131,068 slots. Organizations, communities, and software vendors can assign any meaning to these code points for internal use — but the assignments are not interoperable without prior agreement between parties.

The ConScript Unicode Registry (CSUR) maintains informal assignments in these planes for constructed scripts (Klingon, Tengwar, Cirth) and other community projects.

How Planes Affect Encoding

The plane structure has direct consequences for the three Unicode encoding forms:

UTF-8

Plane	Bytes per Character
BMP (Plane 0)	1–3 bytes
All other planes (1–16)	4 bytes

For BMP characters, UTF-8 uses one byte for ASCII (U+0000–U+007F), two bytes for U+0080–U+07FF, and three bytes for U+0800–U+FFFF. Characters outside the BMP always require four bytes.

UTF-16

Plane	Code Units
BMP (Plane 0)	1 code unit (2 bytes)
All other planes (1–16)	2 code units / surrogate pair (4 bytes)

UTF-16 was designed around the BMP. When Unicode expanded beyond 65,536 code points, the surrogate mechanism was added to handle planes 1–16. Each non-BMP character is encoded as a pair of 16-bit surrogates.

UTF-32

Every code point takes exactly 4 bytes regardless of plane. This makes random access trivial but wastes space for text that is primarily BMP characters.

Programming Implications

Detecting the Plane

You can determine a code point's plane by examining its numerical value:

def get_plane(code_point: int) -> int:
    # Return the Unicode plane number (0-16) for a code point.
    return code_point >> 16

# Examples
print(get_plane(0x0041))    # 0 — BMP
print(get_plane(0x1F600))   # 1 — SMP
print(get_plane(0x20000))   # 2 — SIP
print(get_plane(0xE0001))   # 14 — SSP
print(get_plane(0x100000))  # 16 — SPUA-B

JavaScript String Length Pitfall

Because JavaScript strings use UTF-16 internally, characters outside the BMP (planes 1–16) appear as two code units:

// BMP character — 1 code unit
"A".length         // 1
"中".length        // 1

// SMP character — 2 code units (surrogate pair)
"😀".length        // 2
"𝐀".length         // 2 (U+1D400, Mathematical Bold Capital A)

// Correct code-point-aware counting
[..."😀"].length   // 1

This is one of the most common bugs in JavaScript string processing. Any code that uses .length, .charAt(), .charCodeAt(), or .slice() on strings containing non-BMP characters can produce incorrect results. Use the spread operator, for...of, or codePointAt() instead.

Database Storage

If your database column uses UTF-16 (like SQL Server's NVARCHAR), non-BMP characters consume 4 bytes instead of 2. MySQL's utf8 charset only supports the BMP — you need utf8mb4 to store characters from planes 1–16 (including all emoji).

-- MySQL: use utf8mb4 for full Unicode support
CREATE TABLE messages (
    id INT PRIMARY KEY,
    content TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
);

Key Takeaways

Unicode divides its 1,114,112 code points into 17 planes of 65,536 code points each.
Plane 0 (BMP) holds the vast majority of commonly used characters across all modern writing systems.
Plane 1 (SMP) contains emoji, historic scripts, and mathematical letter variants.
Planes 2–3 (SIP, TIP) hold rare and historical CJK ideographs.
Plane 14 (SSP) provides variation selectors and (deprecated) language tags.
Planes 15–16 are Private Use Areas for application-specific characters.
Planes 4–13 are entirely unassigned, reserved for future growth.
The plane boundary matters for encoding: non-BMP characters require 4 bytes in UTF-8, surrogate pairs in UTF-16, and can cause .length bugs in JavaScript.
Use MySQL utf8mb4 (not utf8) to support all Unicode planes.

Ещё в Unicode Fundamentals

What is Unicode? A Complete Guide

Unicode is the universal character encoding standard that assigns a unique number …

UTF-8 Encoding Explained

UTF-8 is the dominant character encoding on the web, capable of representing …

UTF-8 vs UTF-16 vs UTF-32: When to Use Each

UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …

What is a Unicode Code Point?

A Unicode code point is the unique number assigned to each character …

Understanding Byte Order Mark (BOM)

The Byte Order Mark (BOM) is a special Unicode character used at …

Surrogate Pairs Explained

Surrogate pairs are a mechanism in UTF-16 that allows code points outside …

ASCII to Unicode: The Evolution of Character Encoding

ASCII defined 128 characters for the English alphabet and was the foundation …

Unicode Normalization: NFC, NFD, NFKC, NFKD

The same visible character can be represented by multiple different byte sequences …

The Unicode Bidirectional Algorithm

The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …

Unicode General Categories Explained

Every Unicode character belongs to a general category such as Letter, Number, …

Understanding Unicode Blocks

Unicode blocks are contiguous ranges of code points grouped by script or …

Unicode Scripts: How Writing Systems are Organized

Unicode assigns every character to a script property that identifies the writing …

What are Combining Characters?

Combining characters are Unicode code points that attach to a preceding base …

Grapheme Clusters vs Code Points

A single visible character on screen — called a grapheme cluster — …

Unicode Confusables: A Security Guide

Unicode confusables are characters that look identical or nearly identical to others, …

Zero Width Characters: What They Are and Why They Matter

Zero-width characters are invisible Unicode code points that affect text layout, joining, …

Unicode Whitespace Characters Guide

Unicode defines over two dozen whitespace characters beyond the ordinary space, including …

History of Unicode

Unicode began in 1987 as a collaboration between engineers at Apple and …

Unicode Versions Timeline

Unicode has released major versions regularly since 1.0 in 1991, with each …

← Вернуться к руководствам