What is ブロック?

名前付きの連続したコードポイント範囲（例：基本ラテン = U+0000〜U+007F）。Unicode 16.0は336個のブロックを定義し、すべてのコードポイントはちょうど1つのブロックに属します。

What is スクリプト?

文字が属する文字体系（例：ラテン、キリル、漢字）。Unicode 16.0は168個のスクリプトを定義し、Scriptプロパティはセキュリティと混在スクリプト検出に重要です。

What is 一般カテゴリー?

すべてのコードポイントを30個のカテゴリ（Lu・Ll・Nd・Soなど）の1つに分類する体系で、7つの主要クラス（文字・記号・数字・句読点・記号・区切り・その他）にグループ化されています。

Unicode 標準

Unicode 文字データベース (UCD)

すべてのUnicode文字プロパティを定義する機械可読データファイルの集合で、UnicodeData.txt・Blocks.txt・Scripts.txtなどが含まれます。

2021-07-12 · Updated 2024-03-25

What is the Unicode Character Database?

The Unicode Character Database (UCD) is the authoritative, machine-readable repository of all properties for every Unicode code point. Where the Unicode Standard describes characters in prose and tables, the UCD provides that same information in structured data files that software libraries can parse and implement automatically. Every Unicode library — ICU, Python's unicodedata, Java's Character class, .NET's CharUnicodeInfo — is built from the UCD.

The UCD is distributed as a collection of plain-text files published on unicode.org for each Unicode version. The files follow documented formats (simple tables, two-column mappings, or the comprehensive UnicodeData.txt) and are freely available for any use.

Core UCD Files

File	Description
`UnicodeData.txt`	One line per assigned code point: name, category, combining class, bidi class, decomposition, numeric values, case mappings
`PropList.txt`	Boolean properties like `White_Space`, `Dash`, `Diacritic`, `Extender`
`DerivedCoreProperties.txt`	Derived properties like `Alphabetic`, `Math`, `ID_Start`, `ID_Continue`
`Blocks.txt`	Block name → code point range mapping
`Scripts.txt`	Script assignment for each code point (Latin, Arabic, Han, etc.)
`EmojiData.txt`	Emoji-specific properties: `Emoji`, `Emoji_Presentation`, `Emoji_Modifier`
`NameAliases.txt`	Formal aliases (corrections, abbreviations, alternate names)
`CaseFolding.txt`	Case-insensitive comparison mappings
`NormalizationTest.txt`	Test vectors for NFC/NFD/NFKC/NFKD implementations
`CompositionExclusions.txt`	Code points excluded from canonical composition

UnicodeData.txt Format

The most important UCD file is UnicodeData.txt. Each line has 15 semicolon-delimited fields:

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
│    │                       │  │ │
│    │                       │  │ └── Bidi class (L = Left-to-right)
│    │                       │  └──── Canonical combining class (0 = not combining)
│    │                       └─────── General category (Lu = Uppercase Letter)
│    └─────────────────────────────── Character name
└──────────────────────────────────── Code point (hex)

The 15 fields in order: 1. Code point (hex) 2. Character name 3. General category 4. Canonical combining class 5. Bidi class 6. Character decomposition 7–9. Numeric values (decimal, digit, numeric) 10. Mirror flag 11. Unicode 1.0 name (legacy) 12. ISO comment (deprecated) 13–15. Uppercase, lowercase, titlecase mappings

Using the UCD in Practice

import unicodedata

# Python's unicodedata module is a UCD interface
char = "A"
print(unicodedata.name(char))           # LATIN CAPITAL LETTER A
print(unicodedata.category(char))       # Lu (Uppercase Letter)
print(unicodedata.bidirectional(char))  # L
print(unicodedata.combining(char))      # 0 (non-combining)
print(unicodedata.normalize("NFD", "é")) # e + combining accent

# Reading UnicodeData.txt directly
import urllib.request

url = "https://unicode.org/Public/UCD/latest/ucd/UnicodeData.txt"
with urllib.request.urlopen(url) as f:
    for line in f:
        fields = line.decode().strip().split(";")
        cp, name, category = fields[0], fields[1], fields[2]
        if category == "So":  # Other Symbol
            print(f"U+{cp}: {name}")

Common Pitfalls

Name vs Alias: UnicodeData.txt lists the normative name, but some characters have aliases (corrections to historical naming errors). For example, U+FE18 is officially named PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET (note the typo), but the alias file provides the corrected name.

Ranges in UnicodeData.txt: CJK Unified Ideographs are not listed individually. Instead, two lines mark the start and end of the range:

4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FFF;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;

Quick Facts

Property	Value
Primary file	`UnicodeData.txt`
Total files in UCD	~60 files
URL	unicode.org/Public/UCD/latest/ucd/
License	Unicode License (free, attribution required)
Update frequency	Each Unicode version
Key consumer	ICU, Python unicodedata, Java Character, .NET
Fields in UnicodeData.txt	15 per line

Unicode 標準のその他の用語

CJK（漢字・かな・ハングル）

中国語・日本語・韓国語 — Unicodeにおける統合漢字ブロックと関連スクリプトをまとめた総称。CJK統合漢字は20,992文字以上を含みます。

Han Unification

The process of mapping Chinese, Japanese, and Korean ideographs that share a …

Hangul Jamo

The individual consonant and vowel components (jamo) of the Korean Hangul writing …

ISO 10646 / 万国文字集合

Unicodeと同期している国際標準（ISO/IEC 10646）で、同じ文字目録とコードポイントを定義しますが、Unicodeの追加アルゴリズムやプロパティは含みません。

Unicode

あらゆる文字システムのすべての文字に固有の番号（コードポイント）を割り当てる普遍的文字エンコーディング規格。バージョン16.0には154,998個の割り当て済み文字が含まれます。

Unicode Standard Annex (UAX)

Normative or informative documents that are integral parts of the Unicode Standard. …

Unicode Technical Report (UTR)

Informational documents published by the Unicode Consortium covering specific topics like security …

Unicode コンソーシアム

Unicode標準を開発・維持する非営利団体。Apple・Google・Microsoft・Metaなど多くの企業が会員です。

Unicode スカラー値

サロゲートコードポイント（U+D800〜U+DFFF）を除くすべてのコードポイント。実際の文字を表すことができる有効な値の集合で、合計1,112,064個です。

Unicode バージョン

新しい文字・文字体系・機能を追加するUnicode標準の主要リリース。現在のバージョンはUnicode 16.0（2025年9月）です。

← 用語集へ