一般类别
将每个码位归入30个类别(Lu、Ll、Nd、So等)之一的分类体系,分为7大类:字母、标记、数字、标点、符号、分隔符和其他。
What Is the General Category?
Every Unicode character is assigned a General Category (GC): a two-letter code that classifies it into a broad type such as uppercase letter, decimal digit, currency symbol, or control character. The category is one of the most fundamental Unicode properties and drives countless text-processing decisions—from determining whether a character is "alphabetic" to how it should be handled in identifier parsing.
There are seven top-level categories (Letter, Mark, Number, Punctuation, Symbol, Separator, Other) subdivided into 30 specific categories.
The 30 Category Codes
| Code | Name | Example |
|---|---|---|
| Lu | Uppercase Letter | A, Ñ |
| Ll | Lowercase Letter | a, ñ |
| Lt | Titlecase Letter | Dž |
| Lm | Modifier Letter | ʰ (modifier h) |
| Lo | Other Letter | 中, あ, ب |
| Mn | Non-spacing Mark | combining acute ◌́ |
| Mc | Spacing Mark | ◌ा (Devanagari vowel sign) |
| Me | Enclosing Mark | combining enclosing circle |
| Nd | Decimal Digit | 0–9, ٠–٩ (Arabic-Indic) |
| Nl | Letter Number | Ⅻ (Roman numeral 12) |
| No | Other Number | ½, ² |
| Pc | Connector Punctuation | _ (underscore) |
| Pd | Dash Punctuation | -, – |
| Ps | Open Punctuation | ( [ { |
| Pe | Close Punctuation | ) ] } |
| Pi | Initial Punctuation | " « |
| Pf | Final Punctuation | " » |
| Po | Other Punctuation | ! . , |
| Sm | Math Symbol | + = ∑ |
| Sc | Currency Symbol | $ € ¥ |
| Sk | Modifier Symbol | ^ ` ¨ |
| So | Other Symbol | © ♥ 🔶 |
| Zs | Space Separator | U+0020, U+00A0 |
| Zl | Line Separator | U+2028 |
| Zp | Paragraph Separator | U+2029 |
| Cc | Control | U+0000–U+001F |
| Cf | Format | U+200B ZERO WIDTH SPACE |
| Cs | Surrogate | U+D800–U+DFFF |
| Co | Private Use | U+E000–U+F8FF |
| Cn | Unassigned | any unassigned code point |
import unicodedata
samples = [("A", "Lu?"), ("a", "Ll?"), ("1", "Nd?"),
("中", "Lo?"), (" ", "Zs?"), ("©", "So?")]
for char, expected in samples:
gc = unicodedata.category(char)
print(f" {char!r:6} category={gc:4} ({expected})")
# 'A' category=Lu (Lu?)
# 'a' category=Ll (Ll?)
# '1' category=Nd (Nd?)
# '中' category=Lo (Lo?)
# ' ' category=Zs (Zs?)
# '©' category=So (So?)
Common Uses
Python's own str.isalpha(), str.isdigit(), and str.identifier rules are all defined in terms of General Category. Regex \w matches characters in L, N, Pc, and a few others. Security-sensitive applications use GC to detect confusable characters: two characters with GC=Ll (lowercase letter) that look similar but come from different scripts could be used in homograph attacks.
Quick Facts
| Property | Value |
|---|---|
| Unicode property name | General_Category |
| Short alias | gc |
| Number of two-letter codes | 30 |
| Top-level groups | L, M, N, P, S, Z, C (7) |
| Python function | unicodedata.category(char) → two-letter string |
| Spec reference | Unicode Standard Chapter 4 |
相关术语
字符属性 中的更多内容
字符首次被分配时所在的Unicode版本,有助于判断各系统和软件版本的字符支持情况。
Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …
Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …
Unicode property listing all scripts that use a character, broader than the …
具有相同抽象内容但外观可能不同的两个字符序列,比规范等价更宽泛,例如fi ≈ fi,² ≈ 2。
将字符映射为其组成部分的过程。规范分解保留语义(é → e + ◌́),兼容分解可能改变语义(fi → fi)。
命名的连续码位范围(如基本拉丁文 = U+0000–U+007F)。Unicode 16.0定义了336个区块,每个码位恰好属于一个区块。
决定字符在双向文本中(LTR、RTL、弱、中性)行为方式的属性,由Unicode双向算法用于确定显示顺序。
由于稳定性策略规定Unicode名称不可更改,因此提供字符的备用名称,用于更正、缩写和别名。
将字符在大写、小写和标题大小写之间转换的规则,可能因区域设置而异(土耳其语I问题),也存在一对多映射(ß → SS)。