Understanding Unicode Blocks
Unicode blocks are contiguous ranges of code points grouped by script or character type, making it easier to find and work with related characters. This guide explains the block structure, lists the most important blocks, and shows how to look up block membership.
When you look up a character in the Unicode Standard, one of the first things you see is which block it belongs to. A Unicode block is a contiguous, non-overlapping range of code points with a descriptive name — like "Basic Latin" (U+0000–U+007F) or "CJK Unified Ideographs" (U+4E00–U+9FFF). Blocks are the most intuitive way to navigate the Unicode code point space, yet they are often confused with scripts and categories. This guide explains what blocks really are, how they differ from other classification systems, and how to use block information effectively in code and research.
What Is a Unicode Block?
A Unicode block is a named, contiguous range of code points that was allocated by the Unicode Consortium for a particular purpose or script. Key rules:
- Contiguous: a block is always a continuous range from a start code point to an end code point, with no gaps.
- Non-overlapping: no code point belongs to more than one block.
- Exhaustive partition: every code point from U+0000 to U+10FFFF belongs to exactly one block (unallocated ranges belong to a block named "No_block").
- Stable boundaries: once a block is defined, its start and end points never change. The block may gain new characters in future Unicode versions, but only within its existing boundaries.
- Aligned: block boundaries are always multiples of 16 (i.e., they start at
U+xxx0and end atU+xxxF).
As of Unicode 16.0, there are 336 named blocks covering the Basic Multilingual Plane (BMP) and the supplementary planes.
Major Unicode Blocks
Here are some of the most important and frequently referenced blocks:
Plane 0 — Basic Multilingual Plane (BMP)
| Block Name | Range | Characters | Purpose |
|---|---|---|---|
| Basic Latin | U+0000–U+007F | 128 | ASCII: English letters, digits, basic punctuation |
| Latin-1 Supplement | U+0080–U+00FF | 128 | Western European accented letters, symbols |
| Latin Extended-A | U+0100–U+017F | 128 | Central/Eastern European Latin letters |
| Latin Extended-B | U+0180–U+024F | 208 | African, Croatian, Romanian, and more |
| Greek and Coptic | U+0370–U+03FF | 135 | Greek alphabet + Coptic letters |
| Cyrillic | U+0400–U+04FF | 256 | Russian, Ukrainian, Bulgarian, Serbian, etc. |
| Arabic | U+0600–U+06FF | 256 | Arabic script characters |
| Devanagari | U+0900–U+097F | 128 | Hindi, Sanskrit, Marathi, Nepali |
| CJK Unified Ideographs | U+4E00–U+9FFF | 20,992 | Chinese, Japanese kanji, Korean hanja |
| Hangul Syllables | U+AC00–U+D7AF | 11,184 | Pre-composed Korean syllable blocks |
| General Punctuation | U+2000–U+206F | 112 | Dashes, quotation marks, invisible formatters |
| Currency Symbols | U+20A0–U+20CF | 48 | ₠ ₡ ₢ ... ₿ and beyond |
| Mathematical Operators | U+2200–U+22FF | 256 | ∀ ∃ ∈ ∑ ∫ ∞ and many more |
| Box Drawing | U+2500–U+257F | 128 | ─ │ ┌ ┐ └ ┘ and friends |
| Miscellaneous Symbols | U+2600–U+26FF | 256 | ☀ ☁ ☂ ☃ ♠ ♥ ♦ ♣ and more |
| Dingbats | U+2700–U+27BF | 192 | ✂ ✈ ✉ ✓ ✗ ❤ and more |
| Private Use Area | U+E000–U+F8FF | 6,400 | Application-specific characters |
Plane 1 — Supplementary Multilingual Plane (SMP)
| Block Name | Range | Purpose |
|---|---|---|
| Linear B Syllabary | U+10000–U+1007F | Ancient Mycenaean Greek writing |
| Mathematical Alphanumeric Symbols | U+1D400–U+1D7FF | Bold, italic, script, fraktur math letters |
| Emoticons | U+1F600–U+1F64F | 😀 😂 😍 and other face emoji |
| Miscellaneous Symbols and Pictographs | U+1F300–U+1F5FF | 🌍 🍎 🏠 and object emoji |
| Transport and Map Symbols | U+1F680–U+1F6FF | 🚀 🚗 🛒 and transport emoji |
| Supplemental Symbols and Pictographs | U+1F900–U+1F9FF | 🤖 🧠 🦊 newer emoji |
Plane 2 — Supplementary Ideographic Plane (SIP)
| Block Name | Range | Purpose |
|---|---|---|
| CJK Unified Ideographs Extension B | U+20000–U+2A6DF | 42,720 rare CJK characters |
| CJK Unified Ideographs Extension C | U+2A700–U+2B73F | Additional rare CJK |
| CJK Unified Ideographs Extension D | U+2B740–U+2B81F | More rare CJK characters |
Blocks vs. Scripts vs. General Categories
This is the most common source of confusion. All three are properties of Unicode characters, but they classify characters along different axes:
| Property | What it tells you | Granularity | Example |
|---|---|---|---|
| Block | Where the code point lives in the number line | Range-based (contiguous) | "Basic Latin" |
| Script | Which writing system the character belongs to | Linguistic | "Latin" |
| General Category | What kind of character it is | Functional | "Lu" (Uppercase Letter) |
Why the Distinction Matters
Consider the dollar sign $ (U+0024):
- Block: Basic Latin (U+0000–U+007F)
- Script: Common (used across many writing systems)
- General Category: Sc (Currency Symbol)
Now consider ₹ (Indian Rupee Sign, U+20B9):
- Block: Currency Symbols (U+20A0–U+20CF)
- Script: Common
- General Category: Sc (Currency Symbol)
Both have the same Script (Common) and same General Category (Sc), but they live in entirely different blocks. The block tells you where in the code point space, while the script and category tell you what and how.
A Block Can Contain Multiple Scripts
The "Basic Latin" block contains characters from the Latin script (A–Z, a–z), the Common script (digits, punctuation, symbols), and even a few control characters. A block is a container for code points, not a writing-system classifier.
A Script Can Span Multiple Blocks
The Latin script spans at least 15 blocks: Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B, Latin Extended Additional, Latin Extended-C, Latin Extended-D, and more. Each time the Consortium needed more Latin characters, they created a new block rather than expanding an existing one (because block boundaries are immutable).
Querying Block Information in Code
Python
Python's unicodedata module does not have a built-in block lookup function, but you can
parse the Unicode Character Database's Blocks.txt file or use the unicodeblock package.
A simple approach with ranges:
import unicodedata
# Manual block check for Basic Latin
def is_basic_latin(char: str) -> bool:
cp = ord(char)
return 0x0000 <= cp <= 0x007F
# Using the unicodedata name as a rough proxy
name = unicodedata.name("A", "") # 'LATIN CAPITAL LETTER A'
# The word 'LATIN' hints at the script/block family, but it's not the block name itself.
For reliable block lookups, use a library like fontTools:
from fontTools.unicodedata import block
block("A") # 'Basic Latin'
block("\u4E2D") # 'CJK Unified Ideographs'
block("\U0001F600") # 'Emoticons'
JavaScript
JavaScript regex supports Unicode block matching via the Script_Extensions property, but
not blocks directly. You can use range-based checks:
function getBlock(char) {
const cp = char.codePointAt(0);
if (cp >= 0x0000 && cp <= 0x007F) return "Basic Latin";
if (cp >= 0x0080 && cp <= 0x00FF) return "Latin-1 Supplement";
if (cp >= 0x4E00 && cp <= 0x9FFF) return "CJK Unified Ideographs";
if (cp >= 0xAC00 && cp <= 0xD7AF) return "Hangul Syllables";
// ... add more ranges as needed
return "Unknown";
}
For comprehensive block data, use a library like unicode-properties or parse the UCD
Blocks.txt file.
Java
Java provides built-in block support through Character.UnicodeBlock:
Character.UnicodeBlock block = Character.UnicodeBlock.of('A');
// block == Character.UnicodeBlock.BASIC_LATIN
Character.UnicodeBlock cjk = Character.UnicodeBlock.of(0x4E2D);
// cjk == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS
Regular Expressions
Some regex flavors (Java, .NET, ICU) support \p{InBlockName} syntax:
\p{InBasicLatin} — matches any character in Basic Latin
\p{InCurrencySymbols} — matches any character in Currency Symbols
\p{InCJKUnifiedIdeographs} — matches CJK ideographs
Note: Python's built-in re module does not support block-based matching. Use the regex
package instead:
import regex
regex.findall(r"\p{InBasicLatin}+", "Hello 世界")
# ['Hello ']
How Blocks Are Created and Named
The Unicode Consortium follows a deliberate process when creating blocks:
- Proposal: a script or character set is proposed for encoding, usually with a formal document submitted to the Unicode Technical Committee (UTC).
- Allocation: the UTC allocates a range of code points large enough for the initial repertoire plus room for future additions.
- Naming: the block receives a descriptive name based on its primary content.
- Publication: the block appears in the next Unicode version.
Block names are stable — once published, a block name is never changed. This occasionally leads to slightly misleading names. For example, the "Greek and Coptic" block was originally just "Greek" but was renamed when Coptic characters were added; however, once the separate "Coptic" block (U+2C80–U+2CFF) was created, the old block kept its compound name.
Practical Uses of Block Information
1. Character Browsing and Lookup
Unicode reference sites (including UnicodeFYI) organize characters by block because blocks provide a natural, ordered structure. If you know a character is a mathematical symbol, you can browse the "Mathematical Operators" block (U+2200–U+22FF) to find it.
2. Font Coverage Analysis
Font designers think in terms of blocks. A font that claims to "support Latin Extended-A"
includes glyphs for all characters in U+0100–U+017F. Tools like fc-query and fontTools
can report which blocks a font covers.
3. Input Method Design
Input methods and character pickers often organize characters by block. When you open the Windows Character Map or macOS Character Viewer, the groupings correspond roughly to Unicode blocks.
4. Data Filtering
If you need to accept only characters from certain ranges — say, only BMP characters for a legacy system — block ranges give you clean boundaries:
def is_bmp(char: str) -> bool:
return ord(char) <= 0xFFFF
def is_supplementary(char: str) -> bool:
return ord(char) > 0xFFFF
5. Detecting Script Mixtures (Rough Heuristic)
While the Script property is the proper tool for mixed-script detection, block ranges provide a quick heuristic:
def has_cjk(text: str) -> bool:
return any(0x4E00 <= ord(c) <= 0x9FFF for c in text)
def has_cyrillic(text: str) -> bool:
return any(0x0400 <= ord(c) <= 0x04FF for c in text)
Common Misconceptions
"A block equals a script." False. The Basic Latin block contains Latin letters, Common digits, and Common punctuation. The Latin script spans over 15 blocks.
"All characters in a block are assigned." False. Many blocks contain unassigned code points reserved for future use. For example, the "Greek and Coptic" block has 135 assigned characters in a range of 144 code points.
"Block boundaries can change." False. Once published, a block's start and end code points are permanent. If more characters are needed, a new block is created (e.g., "Latin Extended-C" when "Latin Extended-B" filled up).
"Blocks are contiguous across the code space." Not quite. While each block is internally contiguous, there can be gaps between blocks — ranges assigned to "No_block" that have not yet been organized into a named block.
The Plane Structure
Blocks are distributed across Unicode's 17 planes (0–16):
| Plane | Range | Name | Notable Blocks |
|---|---|---|---|
| 0 | U+0000–U+FFFF | BMP | Basic Latin, CJK, Hangul, Arabic, Devanagari |
| 1 | U+10000–U+1FFFF | SMP | Emoji, math symbols, historic scripts |
| 2 | U+20000–U+2FFFF | SIP | CJK Extensions B–F |
| 3 | U+30000–U+3FFFF | TIP | CJK Extension G, H |
| 14 | U+E0000–U+E0FFF | SSP | Tags, variation selectors supplement |
| 15–16 | U+F0000–U+10FFFF | PUA | Supplementary Private Use Areas |
Planes 4–13 are currently unassigned, providing room for future growth.
Block Data File: Blocks.txt
The authoritative source for block definitions is the Unicode Character Database file
Blocks.txt. Each line defines a block:
# Blocks-16.0.0.txt
0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
0100..017F; Latin Extended-A
0180..024F; Latin Extended-B
...
1F600..1F64F; Emoticons
1F650..1F67F; Ornamental Dingbats
You can download this file from https://www.unicode.org/Public/UCD/latest/ucd/Blocks.txt.
Summary
- A Unicode block is a contiguous, non-overlapping range of code points with a stable name.
- Unicode 16.0 defines 336 blocks across 17 planes.
- Blocks differ from scripts (which writing system) and General Categories (what kind of character): a block is purely about location in the code point space.
- Block boundaries are immutable — they never shrink, expand, or move once published.
- Use block information for browsing, font coverage, input method design, and quick range-based filtering.
- For linguistic analysis and security, prefer the Script property over blocks.
- The authoritative source is the UCD
Blocks.txtfile, parseable byfontTools, Java'sCharacter.UnicodeBlock, or manual range checks.
Thêm trong Unicode Fundamentals
Unicode is the universal character encoding standard that assigns a unique number …
UTF-8 is the dominant character encoding on the web, capable of representing …
UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …
A Unicode code point is the unique number assigned to each character …
Unicode is divided into 17 planes, each containing up to 65,536 code …
The Byte Order Mark (BOM) is a special Unicode character used at …
Surrogate pairs are a mechanism in UTF-16 that allows code points outside …
ASCII defined 128 characters for the English alphabet and was the foundation …
The same visible character can be represented by multiple different byte sequences …
The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …
Every Unicode character belongs to a general category such as Letter, Number, …
Unicode assigns every character to a script property that identifies the writing …
Combining characters are Unicode code points that attach to a preceding base …
A single visible character on screen — called a grapheme cluster — …
Unicode confusables are characters that look identical or nearly identical to others, …
Zero-width characters are invisible Unicode code points that affect text layout, joining, …
Unicode defines over two dozen whitespace characters beyond the ordinary space, including …
Unicode began in 1987 as a collaboration between engineers at Apple and …
Unicode has released major versions regularly since 1.0 in 1991, with each …