📚 Unicode Fundamentals

Understanding Unicode Blocks

Unicode blocks are contiguous ranges of code points grouped by script or character type, making it easier to find and work with related characters. This guide explains the block structure, lists the most important blocks, and shows how to look up block membership.

·

When you look up a character in the Unicode Standard, one of the first things you see is which block it belongs to. A Unicode block is a contiguous, non-overlapping range of code points with a descriptive name — like "Basic Latin" (U+0000–U+007F) or "CJK Unified Ideographs" (U+4E00–U+9FFF). Blocks are the most intuitive way to navigate the Unicode code point space, yet they are often confused with scripts and categories. This guide explains what blocks really are, how they differ from other classification systems, and how to use block information effectively in code and research.

What Is a Unicode Block?

A Unicode block is a named, contiguous range of code points that was allocated by the Unicode Consortium for a particular purpose or script. Key rules:

  1. Contiguous: a block is always a continuous range from a start code point to an end code point, with no gaps.
  2. Non-overlapping: no code point belongs to more than one block.
  3. Exhaustive partition: every code point from U+0000 to U+10FFFF belongs to exactly one block (unallocated ranges belong to a block named "No_block").
  4. Stable boundaries: once a block is defined, its start and end points never change. The block may gain new characters in future Unicode versions, but only within its existing boundaries.
  5. Aligned: block boundaries are always multiples of 16 (i.e., they start at U+xxx0 and end at U+xxxF).

As of Unicode 16.0, there are 336 named blocks covering the Basic Multilingual Plane (BMP) and the supplementary planes.

Major Unicode Blocks

Here are some of the most important and frequently referenced blocks:

Plane 0 — Basic Multilingual Plane (BMP)

Block Name Range Characters Purpose
Basic Latin U+0000–U+007F 128 ASCII: English letters, digits, basic punctuation
Latin-1 Supplement U+0080–U+00FF 128 Western European accented letters, symbols
Latin Extended-A U+0100–U+017F 128 Central/Eastern European Latin letters
Latin Extended-B U+0180–U+024F 208 African, Croatian, Romanian, and more
Greek and Coptic U+0370–U+03FF 135 Greek alphabet + Coptic letters
Cyrillic U+0400–U+04FF 256 Russian, Ukrainian, Bulgarian, Serbian, etc.
Arabic U+0600–U+06FF 256 Arabic script characters
Devanagari U+0900–U+097F 128 Hindi, Sanskrit, Marathi, Nepali
CJK Unified Ideographs U+4E00–U+9FFF 20,992 Chinese, Japanese kanji, Korean hanja
Hangul Syllables U+AC00–U+D7AF 11,184 Pre-composed Korean syllable blocks
General Punctuation U+2000–U+206F 112 Dashes, quotation marks, invisible formatters
Currency Symbols U+20A0–U+20CF 48 ₠ ₡ ₢ ... ₿ and beyond
Mathematical Operators U+2200–U+22FF 256 ∀ ∃ ∈ ∑ ∫ ∞ and many more
Box Drawing U+2500–U+257F 128 ─ │ ┌ ┐ └ ┘ and friends
Miscellaneous Symbols U+2600–U+26FF 256 ☀ ☁ ☂ ☃ ♠ ♥ ♦ ♣ and more
Dingbats U+2700–U+27BF 192 ✂ ✈ ✉ ✓ ✗ ❤ and more
Private Use Area U+E000–U+F8FF 6,400 Application-specific characters

Plane 1 — Supplementary Multilingual Plane (SMP)

Block Name Range Purpose
Linear B Syllabary U+10000–U+1007F Ancient Mycenaean Greek writing
Mathematical Alphanumeric Symbols U+1D400–U+1D7FF Bold, italic, script, fraktur math letters
Emoticons U+1F600–U+1F64F 😀 😂 😍 and other face emoji
Miscellaneous Symbols and Pictographs U+1F300–U+1F5FF 🌍 🍎 🏠 and object emoji
Transport and Map Symbols U+1F680–U+1F6FF 🚀 🚗 🛒 and transport emoji
Supplemental Symbols and Pictographs U+1F900–U+1F9FF 🤖 🧠 🦊 newer emoji

Plane 2 — Supplementary Ideographic Plane (SIP)

Block Name Range Purpose
CJK Unified Ideographs Extension B U+20000–U+2A6DF 42,720 rare CJK characters
CJK Unified Ideographs Extension C U+2A700–U+2B73F Additional rare CJK
CJK Unified Ideographs Extension D U+2B740–U+2B81F More rare CJK characters

Blocks vs. Scripts vs. General Categories

This is the most common source of confusion. All three are properties of Unicode characters, but they classify characters along different axes:

Property What it tells you Granularity Example
Block Where the code point lives in the number line Range-based (contiguous) "Basic Latin"
Script Which writing system the character belongs to Linguistic "Latin"
General Category What kind of character it is Functional "Lu" (Uppercase Letter)

Why the Distinction Matters

Consider the dollar sign $ (U+0024):

  • Block: Basic Latin (U+0000–U+007F)
  • Script: Common (used across many writing systems)
  • General Category: Sc (Currency Symbol)

Now consider (Indian Rupee Sign, U+20B9):

  • Block: Currency Symbols (U+20A0–U+20CF)
  • Script: Common
  • General Category: Sc (Currency Symbol)

Both have the same Script (Common) and same General Category (Sc), but they live in entirely different blocks. The block tells you where in the code point space, while the script and category tell you what and how.

A Block Can Contain Multiple Scripts

The "Basic Latin" block contains characters from the Latin script (A–Z, a–z), the Common script (digits, punctuation, symbols), and even a few control characters. A block is a container for code points, not a writing-system classifier.

A Script Can Span Multiple Blocks

The Latin script spans at least 15 blocks: Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B, Latin Extended Additional, Latin Extended-C, Latin Extended-D, and more. Each time the Consortium needed more Latin characters, they created a new block rather than expanding an existing one (because block boundaries are immutable).

Querying Block Information in Code

Python

Python's unicodedata module does not have a built-in block lookup function, but you can parse the Unicode Character Database's Blocks.txt file or use the unicodeblock package. A simple approach with ranges:

import unicodedata

# Manual block check for Basic Latin
def is_basic_latin(char: str) -> bool:
    cp = ord(char)
    return 0x0000 <= cp <= 0x007F

# Using the unicodedata name as a rough proxy
name = unicodedata.name("A", "")  # 'LATIN CAPITAL LETTER A'
# The word 'LATIN' hints at the script/block family, but it's not the block name itself.

For reliable block lookups, use a library like fontTools:

from fontTools.unicodedata import block

block("A")    # 'Basic Latin'
block("\u4E2D")  # 'CJK Unified Ideographs'
block("\U0001F600")  # 'Emoticons'

JavaScript

JavaScript regex supports Unicode block matching via the Script_Extensions property, but not blocks directly. You can use range-based checks:

function getBlock(char) {
    const cp = char.codePointAt(0);
    if (cp >= 0x0000 && cp <= 0x007F) return "Basic Latin";
    if (cp >= 0x0080 && cp <= 0x00FF) return "Latin-1 Supplement";
    if (cp >= 0x4E00 && cp <= 0x9FFF) return "CJK Unified Ideographs";
    if (cp >= 0xAC00 && cp <= 0xD7AF) return "Hangul Syllables";
    // ... add more ranges as needed
    return "Unknown";
}

For comprehensive block data, use a library like unicode-properties or parse the UCD Blocks.txt file.

Java

Java provides built-in block support through Character.UnicodeBlock:

Character.UnicodeBlock block = Character.UnicodeBlock.of('A');
// block == Character.UnicodeBlock.BASIC_LATIN

Character.UnicodeBlock cjk = Character.UnicodeBlock.of(0x4E2D);
// cjk == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS

Regular Expressions

Some regex flavors (Java, .NET, ICU) support \p{InBlockName} syntax:

\p{InBasicLatin}            — matches any character in Basic Latin
\p{InCurrencySymbols}       — matches any character in Currency Symbols
\p{InCJKUnifiedIdeographs}  — matches CJK ideographs

Note: Python's built-in re module does not support block-based matching. Use the regex package instead:

import regex
regex.findall(r"\p{InBasicLatin}+", "Hello 世界")
# ['Hello ']

How Blocks Are Created and Named

The Unicode Consortium follows a deliberate process when creating blocks:

  1. Proposal: a script or character set is proposed for encoding, usually with a formal document submitted to the Unicode Technical Committee (UTC).
  2. Allocation: the UTC allocates a range of code points large enough for the initial repertoire plus room for future additions.
  3. Naming: the block receives a descriptive name based on its primary content.
  4. Publication: the block appears in the next Unicode version.

Block names are stable — once published, a block name is never changed. This occasionally leads to slightly misleading names. For example, the "Greek and Coptic" block was originally just "Greek" but was renamed when Coptic characters were added; however, once the separate "Coptic" block (U+2C80–U+2CFF) was created, the old block kept its compound name.

Practical Uses of Block Information

1. Character Browsing and Lookup

Unicode reference sites (including UnicodeFYI) organize characters by block because blocks provide a natural, ordered structure. If you know a character is a mathematical symbol, you can browse the "Mathematical Operators" block (U+2200–U+22FF) to find it.

2. Font Coverage Analysis

Font designers think in terms of blocks. A font that claims to "support Latin Extended-A" includes glyphs for all characters in U+0100–U+017F. Tools like fc-query and fontTools can report which blocks a font covers.

3. Input Method Design

Input methods and character pickers often organize characters by block. When you open the Windows Character Map or macOS Character Viewer, the groupings correspond roughly to Unicode blocks.

4. Data Filtering

If you need to accept only characters from certain ranges — say, only BMP characters for a legacy system — block ranges give you clean boundaries:

def is_bmp(char: str) -> bool:
    return ord(char) <= 0xFFFF

def is_supplementary(char: str) -> bool:
    return ord(char) > 0xFFFF

5. Detecting Script Mixtures (Rough Heuristic)

While the Script property is the proper tool for mixed-script detection, block ranges provide a quick heuristic:

def has_cjk(text: str) -> bool:
    return any(0x4E00 <= ord(c) <= 0x9FFF for c in text)

def has_cyrillic(text: str) -> bool:
    return any(0x0400 <= ord(c) <= 0x04FF for c in text)

Common Misconceptions

"A block equals a script." False. The Basic Latin block contains Latin letters, Common digits, and Common punctuation. The Latin script spans over 15 blocks.

"All characters in a block are assigned." False. Many blocks contain unassigned code points reserved for future use. For example, the "Greek and Coptic" block has 135 assigned characters in a range of 144 code points.

"Block boundaries can change." False. Once published, a block's start and end code points are permanent. If more characters are needed, a new block is created (e.g., "Latin Extended-C" when "Latin Extended-B" filled up).

"Blocks are contiguous across the code space." Not quite. While each block is internally contiguous, there can be gaps between blocks — ranges assigned to "No_block" that have not yet been organized into a named block.

The Plane Structure

Blocks are distributed across Unicode's 17 planes (0–16):

Plane Range Name Notable Blocks
0 U+0000–U+FFFF BMP Basic Latin, CJK, Hangul, Arabic, Devanagari
1 U+10000–U+1FFFF SMP Emoji, math symbols, historic scripts
2 U+20000–U+2FFFF SIP CJK Extensions B–F
3 U+30000–U+3FFFF TIP CJK Extension G, H
14 U+E0000–U+E0FFF SSP Tags, variation selectors supplement
15–16 U+F0000–U+10FFFF PUA Supplementary Private Use Areas

Planes 4–13 are currently unassigned, providing room for future growth.

Block Data File: Blocks.txt

The authoritative source for block definitions is the Unicode Character Database file Blocks.txt. Each line defines a block:

# Blocks-16.0.0.txt
0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
0100..017F; Latin Extended-A
0180..024F; Latin Extended-B
...
1F600..1F64F; Emoticons
1F650..1F67F; Ornamental Dingbats

You can download this file from https://www.unicode.org/Public/UCD/latest/ucd/Blocks.txt.

Summary

  • A Unicode block is a contiguous, non-overlapping range of code points with a stable name.
  • Unicode 16.0 defines 336 blocks across 17 planes.
  • Blocks differ from scripts (which writing system) and General Categories (what kind of character): a block is purely about location in the code point space.
  • Block boundaries are immutable — they never shrink, expand, or move once published.
  • Use block information for browsing, font coverage, input method design, and quick range-based filtering.
  • For linguistic analysis and security, prefer the Script property over blocks.
  • The authoritative source is the UCD Blocks.txt file, parseable by fontTools, Java's Character.UnicodeBlock, or manual range checks.

Plus dans Unicode Fundamentals

What is Unicode? A Complete Guide

Unicode is the universal character encoding standard that assigns a unique number …

UTF-8 Encoding Explained

UTF-8 is the dominant character encoding on the web, capable of representing …

UTF-8 vs UTF-16 vs UTF-32: When to Use Each

UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …

What is a Unicode Code Point?

A Unicode code point is the unique number assigned to each character …

Unicode Planes and the BMP

Unicode is divided into 17 planes, each containing up to 65,536 code …

Understanding Byte Order Mark (BOM)

The Byte Order Mark (BOM) is a special Unicode character used at …

Surrogate Pairs Explained

Surrogate pairs are a mechanism in UTF-16 that allows code points outside …

ASCII to Unicode: The Evolution of Character Encoding

ASCII defined 128 characters for the English alphabet and was the foundation …

Unicode Normalization: NFC, NFD, NFKC, NFKD

The same visible character can be represented by multiple different byte sequences …

The Unicode Bidirectional Algorithm

The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …

Unicode General Categories Explained

Every Unicode character belongs to a general category such as Letter, Number, …

Unicode Scripts: How Writing Systems are Organized

Unicode assigns every character to a script property that identifies the writing …

What are Combining Characters?

Combining characters are Unicode code points that attach to a preceding base …

Grapheme Clusters vs Code Points

A single visible character on screen — called a grapheme cluster — …

Unicode Confusables: A Security Guide

Unicode confusables are characters that look identical or nearly identical to others, …

Zero Width Characters: What They Are and Why They Matter

Zero-width characters are invisible Unicode code points that affect text layout, joining, …

Unicode Whitespace Characters Guide

Unicode defines over two dozen whitespace characters beyond the ordinary space, including …

History of Unicode

Unicode began in 1987 as a collaboration between engineers at Apple and …

Unicode Versions Timeline

Unicode has released major versions regularly since 1.0 in 1991, with each …