명명된 연속 코드 포인트 범위(예: 기본 라틴 = U+0000~U+007F). Unicode 16.0은 336개 블록을 정의하며, 모든 코드 포인트는 정확히 하나의 블록에 속합니다.

What is 문자 체계?

문자가 속한 문자 체계(예: 라틴, 키릴, 한자). Unicode 16.0은 168개의 문자 체계를 정의하며, Script 속성은 보안 및 혼합 문자 체계 감지에 핵심입니다.

What is 유니코드 문자 데이터베이스 (UCD)?

모든 유니코드 문자 속성을 정의하는 기계 판독 가능한 데이터 파일 모음으로, UnicodeData.txt, Blocks.txt, Scripts.txt 등이 포함됩니다.

속성

일반 범주

모든 코드 포인트를 7개 주요 분류(문자, 기호, 숫자, 구두점, 기호, 구분자, 기타)로 나뉜 30개 범주(Lu, Ll, Nd, So 등) 중 하나로 분류하는 체계.

2022-01-31 · Updated 2024-10-14

What Is the General Category?

Every Unicode character is assigned a General Category (GC): a two-letter code that classifies it into a broad type such as uppercase letter, decimal digit, currency symbol, or control character. The category is one of the most fundamental Unicode properties and drives countless text-processing decisions—from determining whether a character is "alphabetic" to how it should be handled in identifier parsing.

There are seven top-level categories (Letter, Mark, Number, Punctuation, Symbol, Separator, Other) subdivided into 30 specific categories.

The 30 Category Codes

Code	Name	Example
Lu	Uppercase Letter	A, Ñ
Ll	Lowercase Letter	a, ñ
Lt	Titlecase Letter	Dž
Lm	Modifier Letter	ʰ (modifier h)
Lo	Other Letter	中, あ, ب
Mn	Non-spacing Mark	combining acute ◌́
Mc	Spacing Mark	◌ा (Devanagari vowel sign)
Me	Enclosing Mark	combining enclosing circle
Nd	Decimal Digit	0–9, ٠–٩ (Arabic-Indic)
Nl	Letter Number	Ⅻ (Roman numeral 12)
No	Other Number	½, ²
Pc	Connector Punctuation	_ (underscore)
Pd	Dash Punctuation	-, –
Ps	Open Punctuation	( [ {
Pe	Close Punctuation	) ] }
Pi	Initial Punctuation	" «
Pf	Final Punctuation	" »
Po	Other Punctuation	! . ,
Sm	Math Symbol	+ = ∑
Sc	Currency Symbol	$ € ¥
Sk	Modifier Symbol	^ ` ¨
So	Other Symbol	© ♥ 🔶
Zs	Space Separator	U+0020, U+00A0
Zl	Line Separator	U+2028
Zp	Paragraph Separator	U+2029
Cc	Control	U+0000–U+001F
Cf	Format	U+200B ZERO WIDTH SPACE
Cs	Surrogate	U+D800–U+DFFF
Co	Private Use	U+E000–U+F8FF
Cn	Unassigned	any unassigned code point

import unicodedata

samples = [("A", "Lu?"), ("a", "Ll?"), ("1", "Nd?"),
           ("中", "Lo?"), (" ", "Zs?"), ("©", "So?")]

for char, expected in samples:
    gc = unicodedata.category(char)
    print(f"  {char!r:6}  category={gc:4}  ({expected})")

# 'A'   category=Lu    (Lu?)
# 'a'   category=Ll    (Ll?)
# '1'   category=Nd    (Nd?)
# '中'  category=Lo    (Lo?)
# ' '   category=Zs    (Zs?)
# '©'   category=So    (So?)

Common Uses

Python's own str.isalpha(), str.isdigit(), and str.identifier rules are all defined in terms of General Category. Regex \w matches characters in L, N, Pc, and a few others. Security-sensitive applications use GC to detect confusable characters: two characters with GC=Ll (lowercase letter) that look similar but come from different scripts could be used in homograph attacks.

Quick Facts

Property	Value
Unicode property name	`General_Category`
Short alias	`gc`
Number of two-letter codes	30
Top-level groups	L, M, N, P, S, Z, C (7)
Python function	`unicodedata.category(char)` → two-letter string
Spec reference	Unicode Standard Chapter 4