📚 Unicode Fundamentals

Unicode Planes and the BMP

Unicode is divided into 17 planes, each containing up to 65,536 code points, with Plane 0 known as the Basic Multilingual Plane (BMP). This guide explains the structure of Unicode planes, what lives in each one, and why the BMP matters to developers.

·

Unicode's total code point space — U+0000 through U+10FFFF — is enormous: 1,114,112 possible values. To keep this vast range organized, the standard divides it into 17 contiguous regions called planes, each containing exactly 65,536 (2^16) code points. Most of the characters you encounter every day live in Plane 0, the Basic Multilingual Plane (BMP), but the supplementary planes hold everything from emoji and ancient scripts to rare CJK ideographs and special-purpose control characters. This guide walks through all 17 planes, explains what they contain, and shows how the plane structure affects encoding and programming.

Overview of All 17 Planes

Plane Range Abbreviation Name
0 U+0000–U+FFFF BMP Basic Multilingual Plane
1 U+10000–U+1FFFF SMP Supplementary Multilingual Plane
2 U+20000–U+2FFFF SIP Supplementary Ideographic Plane
3 U+30000–U+3FFFF TIP Tertiary Ideographic Plane
4 U+40000–U+4FFFF Unassigned
5 U+50000–U+5FFFF Unassigned
6 U+60000–U+6FFFF Unassigned
7 U+70000–U+7FFFF Unassigned
8 U+80000–U+8FFFF Unassigned
9 U+90000–U+9FFFF Unassigned
10 U+A0000–U+AFFFF Unassigned
11 U+B0000–U+BFFFF Unassigned
12 U+C0000–U+CFFFF Unassigned
13 U+D0000–U+DFFFF Unassigned
14 U+E0000–U+EFFFF SSP Supplementary Special-purpose Plane
15 U+F0000–U+FFFFF SPUA-A Supplementary Private Use Area A
16 U+100000–U+10FFFF SPUA-B Supplementary Private Use Area B

Planes 4 through 13 are entirely unassigned — ten empty planes held in reserve for future growth. With approximately 154,998 characters assigned (as of Unicode 16.0) out of 1,114,112 total slots, there is no shortage of room.

Plane 0: Basic Multilingual Plane (BMP)

The BMP is the most densely populated and most important plane. It covers U+0000 through U+FFFF and holds the characters needed for virtually every modern writing system in active use.

What's in the BMP

Block Range Content
U+0000–U+007F Basic Latin (ASCII) — English letters, digits, common punctuation
U+0080–U+00FF Latin-1 Supplement — accented letters for Western European languages
U+0100–U+024F Latin Extended-A/B — additional Latin letters for Eastern European, Vietnamese
U+0370–U+03FF Greek and Coptic
U+0400–U+04FF Cyrillic
U+0530–U+058F Armenian
U+0590–U+05FF Hebrew
U+0600–U+06FF Arabic
U+0900–U+097F Devanagari (Hindi, Sanskrit, Marathi)
U+0E00–U+0E7F Thai
U+1100–U+11FF Hangul Jamo (Korean)
U+2000–U+206F General Punctuation (em dash, ellipsis, zero-width characters)
U+2100–U+214F Letterlike Symbols (℃, ℮, №)
U+2190–U+21FF Arrows (←, →, ↑, ↓)
U+2200–U+22FF Mathematical Operators (∀, ∃, ∞, ≤, ≥)
U+2600–U+26FF Miscellaneous Symbols (☀, ☎, ♠, ♣)
U+3000–U+303F CJK Symbols and Punctuation
U+3040–U+309F Hiragana
U+30A0–U+30FF Katakana
U+4E00–U+9FFF CJK Unified Ideographs — the 20,992 most common Chinese/Japanese/Korean characters
U+AC00–U+D7AF Hangul Syllables — 11,172 precomposed Korean syllables
U+D800–U+DFFF Surrogates (not characters — used by UTF-16)
U+E000–U+F8FF Private Use Area (6,400 slots for application-specific characters)
U+F900–U+FAFF CJK Compatibility Ideographs
U+FE00–U+FE0F Variation Selectors (text vs. emoji presentation)
U+FF00–U+FFEF Halfwidth and Fullwidth Forms
U+FFF0–U+FFFF Specials (includes U+FFFD REPLACEMENT CHARACTER and U+FEFF BOM)

The BMP alone covers Latin, Greek, Cyrillic, Arabic, Hebrew, Devanagari, Thai, Chinese, Japanese, Korean, and dozens of other scripts — enough for the vast majority of modern text.

The Surrogate Gap

The range U+D800 through U+DFFF (2,048 code points) is permanently reserved for UTF-16 surrogates. These are not characters and will never be assigned. They exist solely as a mechanism for UTF-16 to encode code points above U+FFFF using pairs of 16-bit code units.

This means the BMP has a theoretical capacity of 65,536 code points but an effective capacity of 63,488 (65,536 minus 2,048 surrogates).

Plane 1: Supplementary Multilingual Plane (SMP)

The SMP (U+10000–U+1FFFF) is the second most important plane and the one that has grown fastest in recent Unicode versions, largely thanks to emoji.

Key contents

Block Range Content
U+10000–U+1007F Linear B Syllabary
U+10080–U+100FF Linear B Ideograms
U+10300–U+1032F Old Italic
U+10330–U+1034F Gothic
U+10400–U+1044F Deseret
U+10800–U+1083F Cypriot Syllabary
U+12000–U+1237F Cuneiform
U+13000–U+1342F Egyptian Hieroglyphs
U+16800–U+16A3F Bamum Supplement
U+1D100–U+1D1FF Musical Symbols (𝄞, 𝄢)
U+1D400–U+1D7FF Mathematical Alphanumeric Symbols (𝐀, 𝑩, 𝒞)
U+1F300–U+1F5FF Miscellaneous Symbols and Pictographs (🌍, 🏠, 🔥)
U+1F600–U+1F64F Emoticons (😀, 😂, 😍)
U+1F680–U+1F6FF Transport and Map Symbols (🚀, 🚗, 🚂)
U+1F900–U+1F9FF Supplemental Symbols and Pictographs (🤖, 🧠, 🦊)
U+1FA00–U+1FA6F Chess Symbols
U+1FA70–U+1FAFF Symbols and Pictographs Extended-A (🪐, 🫠)

The SMP is where you find emoji, historic and archaic scripts (Cuneiform, Egyptian Hieroglyphs, Old Italic, Linear B), musical notation, and mathematical letter variants (bold, italic, script, fraktur, double-struck alphabets used in formal mathematics).

Plane 2: Supplementary Ideographic Plane (SIP)

The SIP (U+20000–U+2FFFF) exists to extend the CJK Unified Ideographs beyond what the BMP can hold. As of Unicode 16.0, it contains over 60,000 characters across several extension blocks:

Block Range Characters
CJK Unified Ideographs Extension B U+20000–U+2A6DF 42,720
CJK Unified Ideographs Extension C U+2A700–U+2B73F 4,154
CJK Unified Ideographs Extension D U+2B740–U+2B81F 222
CJK Unified Ideographs Extension E U+2B820–U+2CEAF 5,762
CJK Unified Ideographs Extension F U+2CEB0–U+2EBEF 7,473
CJK Compatibility Ideographs Supplement U+2F800–U+2FA1F 542

These are rare or historical CJK characters — personal names, place names, variant forms, and characters from classical texts. Most everyday Chinese, Japanese, or Korean text uses only the BMP characters, but scholarly work and government databases frequently need SIP characters.

Plane 3: Tertiary Ideographic Plane (TIP)

The TIP (U+30000–U+3FFFF) was introduced in Unicode 13.0 (2020) for even rarer CJK ideographs:

Block Range Characters
CJK Unified Ideographs Extension G U+30000–U+3134F 4,939
CJK Unified Ideographs Extension H U+31350–U+323AF 4,192
CJK Unified Ideographs Extension I U+2EBF0–U+2F7FF 622

These extensions accommodate characters needed for historical texts, regional variants, and comprehensive dictionaries.

Planes 4–13: Unassigned (Reserved for Future)

Ten full planes — 655,360 code points — are completely empty. They serve as a buffer ensuring that Unicode can accommodate future needs without running out of space. No characters are currently planned for these planes, and any future allocation would go through the standard Unicode proposal and review process.

Plane 14: Supplementary Special-purpose Plane (SSP)

The SSP (U+E0000–U+EFFFF) contains two specialized blocks:

Block Range Purpose
Tags U+E0001–U+E007F Language tags (deprecated in favor of higher-level protocols)
Variation Selectors Supplement U+E0100–U+E01EF 240 additional variation selectors for CJK ideograph variants

The Tags block was originally designed to embed language information in plain text (e.g., tagging a run of text as English or Japanese). This use was deprecated because language tagging is better handled by markup (HTML lang attribute) or text protocols. However, these tag characters found new life in emoji flag sequences — regional indicator sequences use them to form subdivision flags (like the flag of Scotland 🏴󠁧󠁢󠁳󠁣󠁴󠁿 or Texas 🏴󠁵󠁳󠁴󠁸󠁿).

The Variation Selectors Supplement provides standardized variation sequences for CJK ideographs, allowing fonts to select specific glyph variants for characters that have multiple accepted visual forms.

Planes 15–16: Private Use Area Planes

Plane Range Code Points
15 (SPUA-A) U+F0000–U+FFFFF 65,534
16 (SPUA-B) U+100000–U+10FFFD 65,534

These two planes are entirely designated as Private Use Area space, extending the 6,400 BMP PUA code points with an additional 131,068 slots. Organizations, communities, and software vendors can assign any meaning to these code points for internal use — but the assignments are not interoperable without prior agreement between parties.

The ConScript Unicode Registry (CSUR) maintains informal assignments in these planes for constructed scripts (Klingon, Tengwar, Cirth) and other community projects.

How Planes Affect Encoding

The plane structure has direct consequences for the three Unicode encoding forms:

UTF-8

Plane Bytes per Character
BMP (Plane 0) 1–3 bytes
All other planes (1–16) 4 bytes

For BMP characters, UTF-8 uses one byte for ASCII (U+0000–U+007F), two bytes for U+0080–U+07FF, and three bytes for U+0800–U+FFFF. Characters outside the BMP always require four bytes.

UTF-16

Plane Code Units
BMP (Plane 0) 1 code unit (2 bytes)
All other planes (1–16) 2 code units / surrogate pair (4 bytes)

UTF-16 was designed around the BMP. When Unicode expanded beyond 65,536 code points, the surrogate mechanism was added to handle planes 1–16. Each non-BMP character is encoded as a pair of 16-bit surrogates.

UTF-32

Every code point takes exactly 4 bytes regardless of plane. This makes random access trivial but wastes space for text that is primarily BMP characters.

Programming Implications

Detecting the Plane

You can determine a code point's plane by examining its numerical value:

def get_plane(code_point: int) -> int:
    # Return the Unicode plane number (0-16) for a code point.
    return code_point >> 16

# Examples
print(get_plane(0x0041))    # 0 — BMP
print(get_plane(0x1F600))   # 1 — SMP
print(get_plane(0x20000))   # 2 — SIP
print(get_plane(0xE0001))   # 14 — SSP
print(get_plane(0x100000))  # 16 — SPUA-B

JavaScript String Length Pitfall

Because JavaScript strings use UTF-16 internally, characters outside the BMP (planes 1–16) appear as two code units:

// BMP character — 1 code unit
"A".length         // 1
"中".length        // 1

// SMP character — 2 code units (surrogate pair)
"😀".length        // 2
"𝐀".length         // 2 (U+1D400, Mathematical Bold Capital A)

// Correct code-point-aware counting
[..."😀"].length   // 1

This is one of the most common bugs in JavaScript string processing. Any code that uses .length, .charAt(), .charCodeAt(), or .slice() on strings containing non-BMP characters can produce incorrect results. Use the spread operator, for...of, or codePointAt() instead.

Database Storage

If your database column uses UTF-16 (like SQL Server's NVARCHAR), non-BMP characters consume 4 bytes instead of 2. MySQL's utf8 charset only supports the BMP — you need utf8mb4 to store characters from planes 1–16 (including all emoji).

-- MySQL: use utf8mb4 for full Unicode support
CREATE TABLE messages (
    id INT PRIMARY KEY,
    content TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
);

Key Takeaways

  • Unicode divides its 1,114,112 code points into 17 planes of 65,536 code points each.
  • Plane 0 (BMP) holds the vast majority of commonly used characters across all modern writing systems.
  • Plane 1 (SMP) contains emoji, historic scripts, and mathematical letter variants.
  • Planes 2–3 (SIP, TIP) hold rare and historical CJK ideographs.
  • Plane 14 (SSP) provides variation selectors and (deprecated) language tags.
  • Planes 15–16 are Private Use Areas for application-specific characters.
  • Planes 4–13 are entirely unassigned, reserved for future growth.
  • The plane boundary matters for encoding: non-BMP characters require 4 bytes in UTF-8, surrogate pairs in UTF-16, and can cause .length bugs in JavaScript.
  • Use MySQL utf8mb4 (not utf8) to support all Unicode planes.

Ещё в Unicode Fundamentals

What is Unicode? A Complete Guide

Unicode is the universal character encoding standard that assigns a unique number …

UTF-8 Encoding Explained

UTF-8 is the dominant character encoding on the web, capable of representing …

UTF-8 vs UTF-16 vs UTF-32: When to Use Each

UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …

What is a Unicode Code Point?

A Unicode code point is the unique number assigned to each character …

Understanding Byte Order Mark (BOM)

The Byte Order Mark (BOM) is a special Unicode character used at …

Surrogate Pairs Explained

Surrogate pairs are a mechanism in UTF-16 that allows code points outside …

ASCII to Unicode: The Evolution of Character Encoding

ASCII defined 128 characters for the English alphabet and was the foundation …

Unicode Normalization: NFC, NFD, NFKC, NFKD

The same visible character can be represented by multiple different byte sequences …

The Unicode Bidirectional Algorithm

The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …

Unicode General Categories Explained

Every Unicode character belongs to a general category such as Letter, Number, …

Understanding Unicode Blocks

Unicode blocks are contiguous ranges of code points grouped by script or …

Unicode Scripts: How Writing Systems are Organized

Unicode assigns every character to a script property that identifies the writing …

What are Combining Characters?

Combining characters are Unicode code points that attach to a preceding base …

Grapheme Clusters vs Code Points

A single visible character on screen — called a grapheme cluster — …

Unicode Confusables: A Security Guide

Unicode confusables are characters that look identical or nearly identical to others, …

Zero Width Characters: What They Are and Why They Matter

Zero-width characters are invisible Unicode code points that affect text layout, joining, …

Unicode Whitespace Characters Guide

Unicode defines over two dozen whitespace characters beyond the ordinary space, including …

History of Unicode

Unicode began in 1987 as a collaboration between engineers at Apple and …

Unicode Versions Timeline

Unicode has released major versions regularly since 1.0 in 1991, with each …