Unicode General Categories Explained
Every Unicode character belongs to a general category such as Letter, Number, Punctuation, or Symbol, which determines how it behaves in text processing. This guide walks through the Unicode general category system and shows how to use categories in code.
Every character in Unicode is assigned a General Category — a two-letter code that classifies the character as a letter, number, symbol, punctuation mark, separator, mark, or "other." This classification is the single most important property for any code that processes text, because it tells you what a character is before you know anything else about it. Regex engines, word-break algorithms, identifier validation, and security scanners all rely on General Category as their first line of decision-making.
This guide walks through the full General Category hierarchy, explains every two-letter value, shows you how to query categories in Python, JavaScript, Java, and regular expressions, and highlights the practical scenarios where categories matter most.
The Two-Level Hierarchy
Unicode organizes General Categories into 7 major classes, each subdivided into specific two-letter values. The first letter of the code identifies the major class; the second letter narrows it down.
| Major Class | Code | Meaning |
|---|---|---|
| L | Letter | Alphabetic and ideographic characters |
| M | Mark | Combining characters (accents, diacritics) |
| N | Number | Numeric characters (digits, fractions, Roman numerals) |
| P | Punctuation | Connectors, dashes, quotes, brackets |
| S | Symbol | Math, currency, modifier, and other symbols |
| Z | Separator | Spaces, line separators, paragraph separators |
| C | Other | Control, format, surrogate, private-use, unassigned |
When the Unicode Standard or a regex engine refers to Lu, the L means "Letter" and the u
means "uppercase." When it refers to just L, it means all Letter subcategories combined.
Complete List of 30 General Categories
Letters (L) — 6 values
| Code | Name | Example | Count (approx.) |
|---|---|---|---|
| Lu | Uppercase Letter | A, Ω, Д | ~1,850 |
| Ll | Lowercase Letter | a, ω, д | ~2,330 |
| Lt | Titlecase Letter | Dž, Lj, Nj | ~31 |
| Lm | Modifier Letter | ʰ, ˈ, ᵃ | ~397 |
| Lo | Other Letter | あ, 中, ก | ~131,000+ |
The Lo category dwarfs all others because it contains CJK ideographs, Hangul syllables,
and characters from scripts that do not distinguish case. Lt is the smallest: it exists
only for a handful of digraph characters in Latin Extended-B where only the first component
is capitalized (e.g., U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON: Dž).
Marks (M) — 3 values
| Code | Name | Example | Description |
|---|---|---|---|
| Mn | Non-spacing Mark | ◌̀ (U+0300), ◌̈ (U+0308) | Combines with preceding character, no width |
| Mc | Spacing Combining Mark | ◌া (U+09BE), ◌ி (U+0BBF) | Combines but occupies space |
| Me | Enclosing Mark | ◌⃝ (U+20DD), ◌⃞ (U+20DE) | Encloses the preceding character |
Marks are essential for scripts like Devanagari, Thai, Arabic, and Hebrew where vowels or tone indicators are written as combining characters above, below, or beside the base letter.
Numbers (N) — 3 values
| Code | Name | Example |
|---|---|---|
| Nd | Decimal Digit | 0–9, ٠–٩, ०–९ |
| Nl | Letter Number | Ⅰ, Ⅱ, Ⅲ, Ⅳ (Roman numerals) |
| No | Other Number | ½, ¼, ³, ①, ㊿ |
Nd includes digits from every script that has a positional numeral system — Arabic-Indic,
Devanagari, Bengali, Thai, and dozens more. All Nd digits have a Numeric_Value property
from 0 to 9, which makes them usable in numeric parsing.
Punctuation (P) — 7 values
| Code | Name | Example |
|---|---|---|
| Pc | Connector Punctuation | _ (underscore), ‿ |
| Pd | Dash Punctuation | - , – , — , ― |
| Ps | Open Punctuation | ( , [ , { , 「 |
| Pe | Close Punctuation | ) , ] , } , 」 |
| Pi | Initial Quote Punctuation | « , ' , " |
| Pf | Final Quote Punctuation | » , ' , " |
| Po | Other Punctuation | . , ! , ? , @ , # |
The distinction between Pi (initial quote) and Pf (final quote) matters for typographic
algorithms that need to pair opening and closing quotation marks correctly.
Symbols (S) — 4 values
| Code | Name | Example |
|---|---|---|
| Sm | Math Symbol | + , − , = , ∑ , ∫ |
| Sc | Currency Symbol | $ , € , ¥ , ₿ , ₹ |
| Sk | Modifier Symbol | ^ , ` , ¨ , ˜ |
| So | Other Symbol | © , ® , ☀ , ★ , 🎵 |
So (Other Symbol) is the catch-all for arrows, box-drawing characters, dingbats, emoji
(when not modified by variation selectors), and many more.
Separators (Z) — 3 values
| Code | Name | Code Point(s) |
|---|---|---|
| Zs | Space Separator | U+0020 (space), U+00A0 (NBSP), U+2003 (em space), etc. |
| Zl | Line Separator | U+2028 only |
| Zp | Paragraph Separator | U+2029 only |
Zl and Zp each contain exactly one character. In practice, most software uses \n
(U+000A, category Cc) as a line separator instead of U+2028.
Other (C) — 5 values
| Code | Name | Description |
|---|---|---|
| Cc | Control | 65 C0/C1 control characters (U+0000–U+001F, U+007F–U+009F) |
| Cf | Format | Invisible formatting characters: ZWJ, ZWNJ, BOM, bidi marks |
| Cs | Surrogate | 2,048 surrogate code points (U+D800–U+DFFF) — UTF-16 only |
| Co | Private Use | 137,468 code points (BMP PUA + Planes 15–16) |
| Cn | Unassigned | All code points not yet assigned a character |
Cs code points never appear as isolated characters — they exist only as part of UTF-16
surrogate pairs. Cn is the largest category by code point count, covering the vast
unassigned space of Unicode's 17 planes.
Querying General Category in Code
Python
Python's unicodedata module provides the category() function:
import unicodedata
unicodedata.category("A") # 'Lu' — Uppercase Letter
unicodedata.category("3") # 'Nd' — Decimal Digit
unicodedata.category("$") # 'Sc' — Currency Symbol
unicodedata.category("\u200D") # 'Cf' — Format (Zero Width Joiner)
To test for a major class, check the first character of the result:
def is_letter(char: str) -> bool:
return unicodedata.category(char).startswith("L")
def is_symbol(char: str) -> bool:
return unicodedata.category(char).startswith("S")
JavaScript
JavaScript exposes General Category through Unicode property escapes in regular expressions (ES2018+):
// Test for a specific category
/^\p{Lu}$/u.test("A"); // true — Uppercase Letter
/^\p{Nd}$/u.test("٣"); // true — Arabic-Indic digit three
// Test for a major class
/^\p{Letter}$/u.test("あ"); // true
/^\p{Number}$/u.test("½"); // true
/^\p{Symbol}$/u.test("€"); // true
The long names (Letter, Number, Symbol) match the major class; the short names
(Lu, Nd, Sc) match the specific two-letter category.
Java
int cp = "A".codePointAt(0);
int type = Character.getType(cp);
// type == Character.UPPERCASE_LETTER (1)
// Or use regex with Unicode categories
"A".matches("\\p{Lu}"); // true
"3".matches("\\p{Nd}"); // true
Regular Expressions (PCRE, Python re)
Most modern regex flavors support \p{Category}:
\p{L} — Any letter (Lu + Ll + Lt + Lm + Lo)
\p{Lu} — Uppercase letter only
\p{N} — Any number (Nd + Nl + No)
\p{Nd} — Decimal digit only
\p{P} — Any punctuation
\p{S} — Any symbol
\p{Z} — Any separator
\p{C} — Any "other" (control, format, surrogate, private use, unassigned)
In Python, you need the regex package (not the built-in re) to use \p{} syntax, or
you can use unicodedata.category() directly.
Why General Categories Matter
1. Identifier Validation
Programming languages define valid identifier characters using General Categories. For example, Python 3's identifier rules follow Unicode Standard Annex #31 (UAX #31):
- Start characters:
Lu,Ll,Lt,Lm,Lo,Nl, plus underscore - Continue characters: all start characters plus
Mn,Mc,Nd,Pc
This means café = 42 is valid Python, because é is Ll (lowercase letter), and
price₹ = 100 is valid because ₹ is not — wait, ₹ is Sc (currency symbol), so it
is not a valid identifier character. General Category is the gatekeeper.
2. Word Segmentation
The Unicode word-break algorithm (UAX #29) uses General Category to decide where words begin
and end. Letters (L) and combining marks (M) stick together; spaces (Zs) and punctuation
(P) create breaks. This is how Ctrl+Shift+Left selects a whole word in your editor.
3. Security — Confusable Detection
Unicode Technical Report #36 (Unicode Security Mechanisms) relies on General Category to
flag suspicious text. A string that mixes Lo characters from different scripts (e.g., Latin
"a" and Cyrillic "а") is a confusable candidate. Category checks are the first filter in
any mixed-script detection algorithm.
4. Text Rendering
Font engines use General Category to decide rendering behavior. Mn (non-spacing mark) characters
must be rendered on top of the preceding base character. Zs characters produce whitespace.
Cf characters are invisible but affect shaping (like ZWJ in emoji sequences).
5. Data Cleaning and Validation
When cleaning user input, General Category tells you what to keep and what to strip:
import unicodedata
def clean_text(text: str) -> str:
"""Remove control characters and unassigned code points."""
return "".join(
c for c in text
if unicodedata.category(c) not in ("Cc", "Cf", "Cs", "Co", "Cn")
or c in ("\n", "\t") # Keep common whitespace controls
)
Edge Cases and Surprises
Emoji are So (Other Symbol) — despite being rendered as colorful images, base emoji
characters have the General Category So. Emoji modifiers and ZWJ sequences involve Cf
and Sk characters as well.
Digits from other scripts are all Nd — Bengali ৩ (U+09E9), Thai ๓ (U+0E53), and
Devanagari ३ (U+0969) are all Nd, just like ASCII 3. They all have Numeric_Value=3.
However, most parsers only accept ASCII digits for numbers.
Underscore is Pc (Connector Punctuation) — not a letter, not a symbol. This is why
it requires special handling in identifier rules: Pc is explicitly added to the "allowed"
set for continue characters.
Surrogates (Cs) are ghosts — they exist in the code point space but should never appear
as characters in well-formed text. They are artifacts of UTF-16 encoding.
The category of an unassigned code point is Cn — and it can change when a future
Unicode version assigns a character to that code point. Code that hard-codes behavior based
on Cn may break when Unicode is updated.
Quick Reference Card
| Task | Category Test |
|---|---|
| Is it a letter? | L (any of Lu, Ll, Lt, Lm, Lo) |
| Is it a digit? | Nd |
| Is it whitespace? | Zs, Zl, Zp (plus Cc for \t, \n, \r) |
| Is it punctuation? | P (any of Pc, Pd, Ps, Pe, Pi, Pf, Po) |
| Is it a symbol? | S (any of Sm, Sc, Sk, So) |
| Is it invisible? | Cf (format), Cc (control), Cn (unassigned) |
| Is it a combining mark? | M (any of Mn, Mc, Me) |
| Safe for identifiers? | L, Nl (start) + Mn, Mc, Nd, Pc (continue) |
Summary
- Unicode defines 30 General Categories organized into 7 major classes (L, M, N, P, S, Z, C).
- Every code point has exactly one General Category — it is a required, non-optional property.
- Categories drive identifier rules, word segmentation, security analysis, rendering, and data cleaning.
- Query categories with
unicodedata.category()in Python,\p{Lu}in regex, orCharacter.getType()in Java. - The
Locategory contains the majority of assigned characters (CJK, Hangul, scripts without case), whileCn(unassigned) covers most of the total code point space. - General Category is stable for assigned characters — once a code point is assigned, its category never changes in future Unicode versions.
Ещё в Unicode Fundamentals
Unicode is the universal character encoding standard that assigns a unique number …
UTF-8 is the dominant character encoding on the web, capable of representing …
UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …
A Unicode code point is the unique number assigned to each character …
Unicode is divided into 17 planes, each containing up to 65,536 code …
The Byte Order Mark (BOM) is a special Unicode character used at …
Surrogate pairs are a mechanism in UTF-16 that allows code points outside …
ASCII defined 128 characters for the English alphabet and was the foundation …
The same visible character can be represented by multiple different byte sequences …
The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …
Unicode blocks are contiguous ranges of code points grouped by script or …
Unicode assigns every character to a script property that identifies the writing …
Combining characters are Unicode code points that attach to a preceding base …
A single visible character on screen — called a grapheme cluster — …
Unicode confusables are characters that look identical or nearly identical to others, …
Zero-width characters are invisible Unicode code points that affect text layout, joining, …
Unicode defines over two dozen whitespace characters beyond the ordinary space, including …
Unicode began in 1987 as a collaboration between engineers at Apple and …
Unicode has released major versions regularly since 1.0 in 1991, with each …