字符属性

一般类别

将每个码位归入30个类别(Lu、Ll、Nd、So等)之一的分类体系,分为7大类:字母、标记、数字、标点、符号、分隔符和其他。

· Updated

What Is the General Category?

Every Unicode character is assigned a General Category (GC): a two-letter code that classifies it into a broad type such as uppercase letter, decimal digit, currency symbol, or control character. The category is one of the most fundamental Unicode properties and drives countless text-processing decisions—from determining whether a character is "alphabetic" to how it should be handled in identifier parsing.

There are seven top-level categories (Letter, Mark, Number, Punctuation, Symbol, Separator, Other) subdivided into 30 specific categories.

The 30 Category Codes

Code Name Example
Lu Uppercase Letter A, Ñ
Ll Lowercase Letter a, ñ
Lt Titlecase Letter
Lm Modifier Letter ʰ (modifier h)
Lo Other Letter 中, あ, ب
Mn Non-spacing Mark combining acute ◌́
Mc Spacing Mark ◌ा (Devanagari vowel sign)
Me Enclosing Mark combining enclosing circle
Nd Decimal Digit 0–9, ٠–٩ (Arabic-Indic)
Nl Letter Number Ⅻ (Roman numeral 12)
No Other Number ½, ²
Pc Connector Punctuation _ (underscore)
Pd Dash Punctuation -, –
Ps Open Punctuation ( [ {
Pe Close Punctuation ) ] }
Pi Initial Punctuation " «
Pf Final Punctuation " »
Po Other Punctuation ! . ,
Sm Math Symbol + = ∑
Sc Currency Symbol $ € ¥
Sk Modifier Symbol ^ ` ¨
So Other Symbol © ♥ 🔶
Zs Space Separator U+0020, U+00A0
Zl Line Separator U+2028
Zp Paragraph Separator U+2029
Cc Control U+0000–U+001F
Cf Format U+200B ZERO WIDTH SPACE
Cs Surrogate U+D800–U+DFFF
Co Private Use U+E000–U+F8FF
Cn Unassigned any unassigned code point
import unicodedata

samples = [("A", "Lu?"), ("a", "Ll?"), ("1", "Nd?"),
           ("中", "Lo?"), (" ", "Zs?"), ("©", "So?")]

for char, expected in samples:
    gc = unicodedata.category(char)
    print(f"  {char!r:6}  category={gc:4}  ({expected})")

# 'A'   category=Lu    (Lu?)
# 'a'   category=Ll    (Ll?)
# '1'   category=Nd    (Nd?)
# '中'  category=Lo    (Lo?)
# ' '   category=Zs    (Zs?)
# '©'   category=So    (So?)

Common Uses

Python's own str.isalpha(), str.isdigit(), and str.identifier rules are all defined in terms of General Category. Regex \w matches characters in L, N, Pc, and a few others. Security-sensitive applications use GC to detect confusable characters: two characters with GC=Ll (lowercase letter) that look similar but come from different scripts could be used in homograph attacks.

Quick Facts

Property Value
Unicode property name General_Category
Short alias gc
Number of two-letter codes 30
Top-level groups L, M, N, P, S, Z, C (7)
Python function unicodedata.category(char) → two-letter string
Spec reference Unicode Standard Chapter 4

相关术语

字符属性 中的更多内容