📚 Unicode Fundamentals

Unicode General Categories Explained

Every Unicode character belongs to a general category such as Letter, Number, Punctuation, or Symbol, which determines how it behaves in text processing. This guide walks through the Unicode general category system and shows how to use categories in code.

Published 2021-08-23 · Updated 2024-09-06

Every character in Unicode is assigned a General Category — a two-letter code that classifies the character as a letter, number, symbol, punctuation mark, separator, mark, or "other." This classification is the single most important property for any code that processes text, because it tells you what a character is before you know anything else about it. Regex engines, word-break algorithms, identifier validation, and security scanners all rely on General Category as their first line of decision-making.

This guide walks through the full General Category hierarchy, explains every two-letter value, shows you how to query categories in Python, JavaScript, Java, and regular expressions, and highlights the practical scenarios where categories matter most.

The Two-Level Hierarchy

Unicode organizes General Categories into 7 major classes, each subdivided into specific two-letter values. The first letter of the code identifies the major class; the second letter narrows it down.

Major Class	Code	Meaning
L	Letter	Alphabetic and ideographic characters
M	Mark	Combining characters (accents, diacritics)
N	Number	Numeric characters (digits, fractions, Roman numerals)
P	Punctuation	Connectors, dashes, quotes, brackets
S	Symbol	Math, currency, modifier, and other symbols
Z	Separator	Spaces, line separators, paragraph separators
C	Other	Control, format, surrogate, private-use, unassigned

When the Unicode Standard or a regex engine refers to Lu, the L means "Letter" and the u means "uppercase." When it refers to just L, it means all Letter subcategories combined.

Complete List of 30 General Categories

Letters (L) — 6 values

Code	Name	Example	Count (approx.)
Lu	Uppercase Letter	A, Ω, Д	~1,850
Ll	Lowercase Letter	a, ω, д	~2,330
Lt	Titlecase Letter	ǅ, ǈ, ǋ	~31
Lm	Modifier Letter	ʰ, ˈ, ᵃ	~397
Lo	Other Letter	あ, 中, ก	~131,000+

The Lo category dwarfs all others because it contains CJK ideographs, Hangul syllables, and characters from scripts that do not distinguish case. Lt is the smallest: it exists only for a handful of digraph characters in Latin Extended-B where only the first component is capitalized (e.g., U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON: ǅ).

Marks (M) — 3 values

Code	Name	Example	Description
Mn	Non-spacing Mark	◌̀ (U+0300), ◌̈ (U+0308)	Combines with preceding character, no width
Mc	Spacing Combining Mark	◌া (U+09BE), ◌ி (U+0BBF)	Combines but occupies space
Me	Enclosing Mark	◌⃝ (U+20DD), ◌⃞ (U+20DE)	Encloses the preceding character

Marks are essential for scripts like Devanagari, Thai, Arabic, and Hebrew where vowels or tone indicators are written as combining characters above, below, or beside the base letter.

Numbers (N) — 3 values

Code	Name	Example
Nd	Decimal Digit	0–9, ٠–٩, ०–९
Nl	Letter Number	Ⅰ, Ⅱ, Ⅲ, Ⅳ (Roman numerals)
No	Other Number	½, ¼, ³, ①, ㊿

Nd includes digits from every script that has a positional numeral system — Arabic-Indic, Devanagari, Bengali, Thai, and dozens more. All Nd digits have a Numeric_Value property from 0 to 9, which makes them usable in numeric parsing.

Punctuation (P) — 7 values

Code	Name	Example
Pc	Connector Punctuation	_ (underscore), ‿
Pd	Dash Punctuation	- , – , — , ―
Ps	Open Punctuation	( , [ , { , 「
Pe	Close Punctuation	) , ] , } , 」
Pi	Initial Quote Punctuation	« , ' , "
Pf	Final Quote Punctuation	» , ' , "
Po	Other Punctuation	. , ! , ? , @ , #

The distinction between Pi (initial quote) and Pf (final quote) matters for typographic algorithms that need to pair opening and closing quotation marks correctly.

Symbols (S) — 4 values

Code	Name	Example
Sm	Math Symbol	+ , − , = , ∑ , ∫
Sc	Currency Symbol	$ , € , ¥ , ₿ , ₹
Sk	Modifier Symbol	^ , ` , ¨ , ˜
So	Other Symbol	© , ® , ☀ , ★ , 🎵

So (Other Symbol) is the catch-all for arrows, box-drawing characters, dingbats, emoji (when not modified by variation selectors), and many more.

Separators (Z) — 3 values

Code	Name	Code Point(s)
Zs	Space Separator	U+0020 (space), U+00A0 (NBSP), U+2003 (em space), etc.
Zl	Line Separator	U+2028 only
Zp	Paragraph Separator	U+2029 only

Zl and Zp each contain exactly one character. In practice, most software uses \n (U+000A, category Cc) as a line separator instead of U+2028.

Other (C) — 5 values

Code	Name	Description
Cc	Control	65 C0/C1 control characters (U+0000–U+001F, U+007F–U+009F)
Cf	Format	Invisible formatting characters: ZWJ, ZWNJ, BOM, bidi marks
Cs	Surrogate	2,048 surrogate code points (U+D800–U+DFFF) — UTF-16 only
Co	Private Use	137,468 code points (BMP PUA + Planes 15–16)
Cn	Unassigned	All code points not yet assigned a character

Cs code points never appear as isolated characters — they exist only as part of UTF-16 surrogate pairs. Cn is the largest category by code point count, covering the vast unassigned space of Unicode's 17 planes.

Querying General Category in Code

Python

Python's unicodedata module provides the category() function:

import unicodedata

unicodedata.category("A")   # 'Lu' — Uppercase Letter
unicodedata.category("3")   # 'Nd' — Decimal Digit
unicodedata.category("$")   # 'Sc' — Currency Symbol
unicodedata.category("\u200D")  # 'Cf' — Format (Zero Width Joiner)

To test for a major class, check the first character of the result:

def is_letter(char: str) -> bool:
    return unicodedata.category(char).startswith("L")

def is_symbol(char: str) -> bool:
    return unicodedata.category(char).startswith("S")

JavaScript

JavaScript exposes General Category through Unicode property escapes in regular expressions (ES2018+):

// Test for a specific category
/^\p{Lu}$/u.test("A");      // true — Uppercase Letter
/^\p{Nd}$/u.test("٣");      // true — Arabic-Indic digit three

// Test for a major class
/^\p{Letter}$/u.test("あ");  // true
/^\p{Number}$/u.test("½");   // true
/^\p{Symbol}$/u.test("€");   // true

The long names (Letter, Number, Symbol) match the major class; the short names (Lu, Nd, Sc) match the specific two-letter category.

Java

int cp = "A".codePointAt(0);
int type = Character.getType(cp);
// type == Character.UPPERCASE_LETTER (1)

// Or use regex with Unicode categories
"A".matches("\\p{Lu}");     // true
"3".matches("\\p{Nd}");     // true

Regular Expressions (PCRE, Python `re`)

Most modern regex flavors support \p{Category}:

\p{L}     — Any letter (Lu + Ll + Lt + Lm + Lo)
\p{Lu}    — Uppercase letter only
\p{N}     — Any number (Nd + Nl + No)
\p{Nd}    — Decimal digit only
\p{P}     — Any punctuation
\p{S}     — Any symbol
\p{Z}     — Any separator
\p{C}     — Any "other" (control, format, surrogate, private use, unassigned)

In Python, you need the regex package (not the built-in re) to use \p{} syntax, or you can use unicodedata.category() directly.

Why General Categories Matter

1. Identifier Validation

Programming languages define valid identifier characters using General Categories. For example, Python 3's identifier rules follow Unicode Standard Annex #31 (UAX #31):

Start characters: Lu, Ll, Lt, Lm, Lo, Nl, plus underscore
Continue characters: all start characters plus Mn, Mc, Nd, Pc

This means café = 42 is valid Python, because é is Ll (lowercase letter), and price₹ = 100 is valid because ₹ is not — wait, ₹ is Sc (currency symbol), so it is not a valid identifier character. General Category is the gatekeeper.

2. Word Segmentation

The Unicode word-break algorithm (UAX #29) uses General Category to decide where words begin and end. Letters (L) and combining marks (M) stick together; spaces (Zs) and punctuation (P) create breaks. This is how Ctrl+Shift+Left selects a whole word in your editor.

3. Security — Confusable Detection

Unicode Technical Report #36 (Unicode Security Mechanisms) relies on General Category to flag suspicious text. A string that mixes Lo characters from different scripts (e.g., Latin "a" and Cyrillic "а") is a confusable candidate. Category checks are the first filter in any mixed-script detection algorithm.

4. Text Rendering

Font engines use General Category to decide rendering behavior. Mn (non-spacing mark) characters must be rendered on top of the preceding base character. Zs characters produce whitespace. Cf characters are invisible but affect shaping (like ZWJ in emoji sequences).

5. Data Cleaning and Validation

When cleaning user input, General Category tells you what to keep and what to strip:

import unicodedata

def clean_text(text: str) -> str:
    """Remove control characters and unassigned code points."""
    return "".join(
        c for c in text
        if unicodedata.category(c) not in ("Cc", "Cf", "Cs", "Co", "Cn")
        or c in ("\n", "\t")  # Keep common whitespace controls
    )

Edge Cases and Surprises

Emoji are So (Other Symbol) — despite being rendered as colorful images, base emoji characters have the General Category So. Emoji modifiers and ZWJ sequences involve Cf and Sk characters as well.

Digits from other scripts are all Nd — Bengali ৩ (U+09E9), Thai ๓ (U+0E53), and Devanagari ३ (U+0969) are all Nd, just like ASCII 3. They all have Numeric_Value=3. However, most parsers only accept ASCII digits for numbers.

Underscore is Pc (Connector Punctuation) — not a letter, not a symbol. This is why it requires special handling in identifier rules: Pc is explicitly added to the "allowed" set for continue characters.

Surrogates (Cs) are ghosts — they exist in the code point space but should never appear as characters in well-formed text. They are artifacts of UTF-16 encoding.

The category of an unassigned code point is Cn — and it can change when a future Unicode version assigns a character to that code point. Code that hard-codes behavior based on Cn may break when Unicode is updated.

Quick Reference Card

Task	Category Test
Is it a letter?	`L` (any of Lu, Ll, Lt, Lm, Lo)
Is it a digit?	`Nd`
Is it whitespace?	`Zs`, `Zl`, `Zp` (plus Cc for `\t`, `\n`, `\r`)
Is it punctuation?	`P` (any of Pc, Pd, Ps, Pe, Pi, Pf, Po)
Is it a symbol?	`S` (any of Sm, Sc, Sk, So)
Is it invisible?	`Cf` (format), `Cc` (control), `Cn` (unassigned)
Is it a combining mark?	`M` (any of Mn, Mc, Me)
Safe for identifiers?	`L`, `Nl` (start) + `Mn`, `Mc`, `Nd`, `Pc` (continue)

Summary

Unicode defines 30 General Categories organized into 7 major classes (L, M, N, P, S, Z, C).
Every code point has exactly one General Category — it is a required, non-optional property.
Categories drive identifier rules, word segmentation, security analysis, rendering, and data cleaning.
Query categories with unicodedata.category() in Python, \p{Lu} in regex, or Character.getType() in Java.
The Lo category contains the majority of assigned characters (CJK, Hangul, scripts without case), while Cn (unassigned) covers most of the total code point space.
General Category is stable for assigned characters — once a code point is assigned, its category never changes in future Unicode versions.