What is Unicode 文字データベース (UCD)?

すべてのUnicode文字プロパティを定義する機械可読データファイルの集合で、UnicodeData.txt・Blocks.txt・Scripts.txtなどが含まれます。

プロパティ

一般カテゴリー

すべてのコードポイントを30個のカテゴリ（Lu・Ll・Nd・Soなど）の1つに分類する体系で、7つの主要クラス（文字・記号・数字・句読点・記号・区切り・その他）にグループ化されています。

2022-01-31 · 更新日 2024-10-14

What Is the General Category?

Every Unicode character is assigned a General Category (GC): a two-letter code that classifies it into a broad type such as uppercase letter, decimal digit, currency symbol, or control character. The category is one of the most fundamental Unicode properties and drives countless text-processing decisions—from determining whether a character is "alphabetic" to how it should be handled in identifier parsing.

There are seven top-level categories (Letter, Mark, Number, Punctuation, Symbol, Separator, Other) subdivided into 30 specific categories.

The 30 Category Codes

Code	Name	Example
Lu	Uppercase Letter	A, Ñ
Ll	Lowercase Letter	a, ñ
Lt	Titlecase Letter	Dž
Lm	Modifier Letter	ʰ (modifier h)
Lo	Other Letter	中, あ, ب
Mn	Non-spacing Mark	combining acute ◌́
Mc	Spacing Mark	◌ा (Devanagari vowel sign)
Me	Enclosing Mark	combining enclosing circle
Nd	Decimal Digit	0–9, ٠–٩ (Arabic-Indic)
Nl	Letter Number	Ⅻ (Roman numeral 12)
No	Other Number	½, ²
Pc	Connector Punctuation	_ (underscore)
Pd	Dash Punctuation	-, –
Ps	Open Punctuation	( [ {
Pe	Close Punctuation	) ] }
Pi	Initial Punctuation	" «
Pf	Final Punctuation	" »
Po	Other Punctuation	! . ,
Sm	Math Symbol	+ = ∑
Sc	Currency Symbol	$ € ¥
Sk	Modifier Symbol	^ ` ¨
So	Other Symbol	© ♥ 🔶
Zs	Space Separator	U+0020, U+00A0
Zl	Line Separator	U+2028
Zp	Paragraph Separator	U+2029
Cc	Control	U+0000–U+001F
Cf	Format	U+200B ZERO WIDTH SPACE
Cs	Surrogate	U+D800–U+DFFF
Co	Private Use	U+E000–U+F8FF
Cn	Unassigned	any unassigned code point

import unicodedata

samples = [("A", "Lu?"), ("a", "Ll?"), ("1", "Nd?"),
           ("中", "Lo?"), (" ", "Zs?"), ("©", "So?")]

for char, expected in samples:
    gc = unicodedata.category(char)
    print(f"  {char!r:6}  category={gc:4}  ({expected})")

# 'A'   category=Lu    (Lu?)
# 'a'   category=Ll    (Ll?)
# '1'   category=Nd    (Nd?)
# '中'  category=Lo    (Lo?)
# ' '   category=Zs    (Zs?)
# '©'   category=So    (So?)

Common Uses

Python's own str.isalpha(), str.isdigit(), and str.identifier rules are all defined in terms of General Category. Regex \w matches characters in L, N, Pc, and a few others. Security-sensitive applications use GC to detect confusable characters: two characters with GC=Ll (lowercase letter) that look similar but come from different scripts could be used in homograph attacks.

Quick Facts

Property	Value
Unicode property name	`General_Category`
Short alias	`gc`
Number of two-letter codes	30
Top-level groups	L, M, N, P, S, Z, C (7)
Python function	`unicodedata.category(char)` → two-letter string
Spec reference	Unicode Standard Chapter 4