ブロック
名前付きの連続したコードポイント範囲(例:基本ラテン = U+0000〜U+007F)。Unicode 16.0は336個のブロックを定義し、すべてのコードポイントはちょうど1つのブロックに属します。
What Is a Unicode Block?
A Unicode block is a named, contiguous range of code points assigned to a related group of characters. The Unicode Standard divides the entire code point space (U+0000 to U+10FFFF) into 336 blocks, each a multiple of 16 code points in size. Block boundaries are fixed—they never change between Unicode versions—though new blocks may be allocated in previously unassigned ranges.
Each block has a distinctive name that broadly describes its contents: Basic Latin (U+0000–U+007F), Greek and Coptic (U+0370–U+03FF), or CJK Unified Ideographs (U+4E00–U+9FFF). The name is informational rather than prescriptive, so a block may contain characters from multiple scripts, or even unassigned code points.
Blocks vs. Scripts
Blocks are purely positional: a character belongs to exactly one block based on its numeric value. Scripts, by contrast, reflect linguistic or cultural affiliation. The Latin Extended Additional block (U+1E00–U+1EFF) contains Latin characters, but the Letterlike Symbols block (U+2100–U+214F) holds characters from many scripts such as ℂ (DOUBLE-STRUCK CAPITAL C) and ℓ (SCRIPT SMALL L). A single block can span multiple scripts, and one script can span multiple blocks.
import unicodedata
# Look up the block of a character using the Unicode data utilities
# Python's unicodedata module does not expose block directly,
# but you can derive it from the character name prefix or use the
# 'unicode_data' third-party package.
for char in ["A", "α", "中", "😀"]:
name = unicodedata.name(char, "<unnamed>")
cp = ord(char)
print(f"U+{cp:04X} {char} {name}")
# U+0041 A LATIN CAPITAL LETTER A → Basic Latin
# U+03B1 α GREEK SMALL LETTER ALPHA → Greek and Coptic
# U+4E2D 中 CJK UNIFIED IDEOGRAPH-4E2D → CJK Unified Ideographs
# U+1F600 😀 GRINNING FACE → Emoticons
Why Blocks Matter
Blocks appear in Unicode Character Ranges used by CSS (@font-face unicode-range), regular expressions (\p{Block=CJK_Unified_Ideographs} in Perl, PCRE, or Java), and font subsetting tools. Knowing a character's block helps font engineers decide which code-point ranges to include in a subset, reducing file size while preserving coverage.
Block assignments also guide rendering engines. For example, shaping engines like HarfBuzz use block membership as one heuristic when selecting a shaping script when no explicit script tag is available.
Quick Facts
| Property | Value |
|---|---|
| Unicode property name | Block |
| Short property alias | blk |
| Number of defined blocks (Unicode 15.1) | 336 |
| Smallest block | 16 code points (many Supplement blocks) |
| Largest block | CJK Unified Ideographs, 20,902 assigned (U+4E00–U+9FFF, 8,192 range) |
| Python access | No built-in; use unicodedata.name() + range lookup or unicodeblock package |
| Regex syntax | \p{Block=Basic_Latin} (PCRE/Perl/Java) |
| Spec reference | Unicode Standard Annex #44, Blocks.txt |
関連用語
プロパティ のその他の用語
文字が最初に割り当てられたUnicodeバージョン。システムやソフトウェアバージョン間での文字サポートを判断するのに役立ちます。
Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …
Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …
Unicode property listing all scripts that use a character, broader than the …
文字を大文字・小文字・タイトルケースに変換するルール。ロケール依存の場合があり(トルコ語のI問題)、1対多のマッピングもあります(ß → SS)。
文字が属する文字体系(例:ラテン、キリル、漢字)。Unicode 16.0は168個のスクリプトを定義し、Scriptプロパティはセキュリティと混在スクリプト検出に重要です。
サポートしていないプロセスで目に見える効果なく無視できる文字で、異体字セレクター・ゼロ幅文字・言語タグなどが含まれます。
RTLコンテキストでグリフを水平に反転すべき文字。例:( → )、[ → ]、{ → }、« → »。
すべてのコードポイントを30個のカテゴリ(Lu・Ll・Nd・Soなど)の1つに分類する体系で、7つの主要クラス(文字・記号・数字・句読点・記号・区切り・その他)にグループ化されています。
同じ抽象的内容を持つが外観が異なる場合がある2つの文字シーケンス。正規等価より広い概念。例:fi ≈ fi、² ≈ 2。