Categoría general
Clasificación de cada punto de código en una de 30 categorías (Lu, Ll, Nd, So, etc.) agrupadas en 7 clases principales: Letra, Marca, Número, Puntuación, Símbolo, Separador, Otro.
What Is the General Category?
Every Unicode character is assigned a General Category (GC): a two-letter code that classifies it into a broad type such as uppercase letter, decimal digit, currency symbol, or control character. The category is one of the most fundamental Unicode properties and drives countless text-processing decisions—from determining whether a character is "alphabetic" to how it should be handled in identifier parsing.
There are seven top-level categories (Letter, Mark, Number, Punctuation, Symbol, Separator, Other) subdivided into 30 specific categories.
The 30 Category Codes
| Code | Name | Example |
|---|---|---|
| Lu | Uppercase Letter | A, Ñ |
| Ll | Lowercase Letter | a, ñ |
| Lt | Titlecase Letter | Dž |
| Lm | Modifier Letter | ʰ (modifier h) |
| Lo | Other Letter | 中, あ, ب |
| Mn | Non-spacing Mark | combining acute ◌́ |
| Mc | Spacing Mark | ◌ा (Devanagari vowel sign) |
| Me | Enclosing Mark | combining enclosing circle |
| Nd | Decimal Digit | 0–9, ٠–٩ (Arabic-Indic) |
| Nl | Letter Number | Ⅻ (Roman numeral 12) |
| No | Other Number | ½, ² |
| Pc | Connector Punctuation | _ (underscore) |
| Pd | Dash Punctuation | -, – |
| Ps | Open Punctuation | ( [ { |
| Pe | Close Punctuation | ) ] } |
| Pi | Initial Punctuation | " « |
| Pf | Final Punctuation | " » |
| Po | Other Punctuation | ! . , |
| Sm | Math Symbol | + = ∑ |
| Sc | Currency Symbol | $ € ¥ |
| Sk | Modifier Symbol | ^ ` ¨ |
| So | Other Symbol | © ♥ 🔶 |
| Zs | Space Separator | U+0020, U+00A0 |
| Zl | Line Separator | U+2028 |
| Zp | Paragraph Separator | U+2029 |
| Cc | Control | U+0000–U+001F |
| Cf | Format | U+200B ZERO WIDTH SPACE |
| Cs | Surrogate | U+D800–U+DFFF |
| Co | Private Use | U+E000–U+F8FF |
| Cn | Unassigned | any unassigned code point |
import unicodedata
samples = [("A", "Lu?"), ("a", "Ll?"), ("1", "Nd?"),
("中", "Lo?"), (" ", "Zs?"), ("©", "So?")]
for char, expected in samples:
gc = unicodedata.category(char)
print(f" {char!r:6} category={gc:4} ({expected})")
# 'A' category=Lu (Lu?)
# 'a' category=Ll (Ll?)
# '1' category=Nd (Nd?)
# '中' category=Lo (Lo?)
# ' ' category=Zs (Zs?)
# '©' category=So (So?)
Common Uses
Python's own str.isalpha(), str.isdigit(), and str.identifier rules are all defined in terms of General Category. Regex \w matches characters in L, N, Pc, and a few others. Security-sensitive applications use GC to detect confusable characters: two characters with GC=Ll (lowercase letter) that look similar but come from different scripts could be used in homograph attacks.
Quick Facts
| Property | Value |
|---|---|
| Unicode property name | General_Category |
| Short alias | gc |
| Number of two-letter codes | 30 |
| Top-level groups | L, M, N, P, S, Z, C (7) |
| Python function | unicodedata.category(char) → two-letter string |
| Spec reference | Unicode Standard Chapter 4 |
Términos relacionados
Más en Propiedades
Nombres alternativos para los caracteres, ya que los nombres de Unicode no …
Rango contiguo de puntos de código con nombre (por ejemplo, Basic Latin …
Propiedad que determina cómo se comporta un carácter en texto bidireccional (LTR, …
Valor numérico (0–254) que controla el orden de los signos combinantes durante …
El «carácter» percibido por el usuario: lo que parece una sola unidad. …
Reglas para convertir caracteres entre mayúsculas, minúsculas y versalitas. Puede depender del …
La descomposición de un carácter en sus partes componentes. La descomposición canónica …
Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …
Dos secuencias de caracteres que son semánticamente idénticas y deben tratarse como …
Dos secuencias de caracteres con el mismo contenido abstracto que pueden diferir …