📚 Unicode Fundamentals

Unicode General Categories Explained

Every Unicode character belongs to a general category such as Letter, Number, Punctuation, or Symbol, which determines how it behaves in text processing. This guide walks through the Unicode general category system and shows how to use categories in code.

·

Every character in Unicode is assigned a General Category — a two-letter code that classifies the character as a letter, number, symbol, punctuation mark, separator, mark, or "other." This classification is the single most important property for any code that processes text, because it tells you what a character is before you know anything else about it. Regex engines, word-break algorithms, identifier validation, and security scanners all rely on General Category as their first line of decision-making.

This guide walks through the full General Category hierarchy, explains every two-letter value, shows you how to query categories in Python, JavaScript, Java, and regular expressions, and highlights the practical scenarios where categories matter most.

The Two-Level Hierarchy

Unicode organizes General Categories into 7 major classes, each subdivided into specific two-letter values. The first letter of the code identifies the major class; the second letter narrows it down.

Major Class Code Meaning
L Letter Alphabetic and ideographic characters
M Mark Combining characters (accents, diacritics)
N Number Numeric characters (digits, fractions, Roman numerals)
P Punctuation Connectors, dashes, quotes, brackets
S Symbol Math, currency, modifier, and other symbols
Z Separator Spaces, line separators, paragraph separators
C Other Control, format, surrogate, private-use, unassigned

When the Unicode Standard or a regex engine refers to Lu, the L means "Letter" and the u means "uppercase." When it refers to just L, it means all Letter subcategories combined.

Complete List of 30 General Categories

Letters (L) — 6 values

Code Name Example Count (approx.)
Lu Uppercase Letter A, Ω, Д ~1,850
Ll Lowercase Letter a, ω, д ~2,330
Lt Titlecase Letter Dž, Lj, Nj ~31
Lm Modifier Letter ʰ, ˈ, ᵃ ~397
Lo Other Letter あ, 中, ก ~131,000+

The Lo category dwarfs all others because it contains CJK ideographs, Hangul syllables, and characters from scripts that do not distinguish case. Lt is the smallest: it exists only for a handful of digraph characters in Latin Extended-B where only the first component is capitalized (e.g., U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON: Dž).

Marks (M) — 3 values

Code Name Example Description
Mn Non-spacing Mark ◌̀ (U+0300), ◌̈ (U+0308) Combines with preceding character, no width
Mc Spacing Combining Mark ◌া (U+09BE), ◌ி (U+0BBF) Combines but occupies space
Me Enclosing Mark ◌⃝ (U+20DD), ◌⃞ (U+20DE) Encloses the preceding character

Marks are essential for scripts like Devanagari, Thai, Arabic, and Hebrew where vowels or tone indicators are written as combining characters above, below, or beside the base letter.

Numbers (N) — 3 values

Code Name Example
Nd Decimal Digit 0–9, ٠–٩, ०–९
Nl Letter Number Ⅰ, Ⅱ, Ⅲ, Ⅳ (Roman numerals)
No Other Number ½, ¼, ³, ①, ㊿

Nd includes digits from every script that has a positional numeral system — Arabic-Indic, Devanagari, Bengali, Thai, and dozens more. All Nd digits have a Numeric_Value property from 0 to 9, which makes them usable in numeric parsing.

Punctuation (P) — 7 values

Code Name Example
Pc Connector Punctuation _ (underscore), ‿
Pd Dash Punctuation - , – , — , ―
Ps Open Punctuation ( , [ , { , 「
Pe Close Punctuation ) , ] , } , 」
Pi Initial Quote Punctuation « , ' , "
Pf Final Quote Punctuation » , ' , "
Po Other Punctuation . , ! , ? , @ , #

The distinction between Pi (initial quote) and Pf (final quote) matters for typographic algorithms that need to pair opening and closing quotation marks correctly.

Symbols (S) — 4 values

Code Name Example
Sm Math Symbol + , − , = , ∑ , ∫
Sc Currency Symbol $ , € , ¥ , ₿ , ₹
Sk Modifier Symbol ^ , ` , ¨ , ˜
So Other Symbol © , ® , ☀ , ★ , 🎵

So (Other Symbol) is the catch-all for arrows, box-drawing characters, dingbats, emoji (when not modified by variation selectors), and many more.

Separators (Z) — 3 values

Code Name Code Point(s)
Zs Space Separator U+0020 (space), U+00A0 (NBSP), U+2003 (em space), etc.
Zl Line Separator U+2028 only
Zp Paragraph Separator U+2029 only

Zl and Zp each contain exactly one character. In practice, most software uses \n (U+000A, category Cc) as a line separator instead of U+2028.

Other (C) — 5 values

Code Name Description
Cc Control 65 C0/C1 control characters (U+0000–U+001F, U+007F–U+009F)
Cf Format Invisible formatting characters: ZWJ, ZWNJ, BOM, bidi marks
Cs Surrogate 2,048 surrogate code points (U+D800–U+DFFF) — UTF-16 only
Co Private Use 137,468 code points (BMP PUA + Planes 15–16)
Cn Unassigned All code points not yet assigned a character

Cs code points never appear as isolated characters — they exist only as part of UTF-16 surrogate pairs. Cn is the largest category by code point count, covering the vast unassigned space of Unicode's 17 planes.

Querying General Category in Code

Python

Python's unicodedata module provides the category() function:

import unicodedata

unicodedata.category("A")   # 'Lu' — Uppercase Letter
unicodedata.category("3")   # 'Nd' — Decimal Digit
unicodedata.category("$")   # 'Sc' — Currency Symbol
unicodedata.category("\u200D")  # 'Cf' — Format (Zero Width Joiner)

To test for a major class, check the first character of the result:

def is_letter(char: str) -> bool:
    return unicodedata.category(char).startswith("L")

def is_symbol(char: str) -> bool:
    return unicodedata.category(char).startswith("S")

JavaScript

JavaScript exposes General Category through Unicode property escapes in regular expressions (ES2018+):

// Test for a specific category
/^\p{Lu}$/u.test("A");      // true — Uppercase Letter
/^\p{Nd}$/u.test("٣");      // true — Arabic-Indic digit three

// Test for a major class
/^\p{Letter}$/u.test("あ");  // true
/^\p{Number}$/u.test("½");   // true
/^\p{Symbol}$/u.test("€");   // true

The long names (Letter, Number, Symbol) match the major class; the short names (Lu, Nd, Sc) match the specific two-letter category.

Java

int cp = "A".codePointAt(0);
int type = Character.getType(cp);
// type == Character.UPPERCASE_LETTER (1)

// Or use regex with Unicode categories
"A".matches("\\p{Lu}");     // true
"3".matches("\\p{Nd}");     // true

Regular Expressions (PCRE, Python re)

Most modern regex flavors support \p{Category}:

\p{L}     — Any letter (Lu + Ll + Lt + Lm + Lo)
\p{Lu}    — Uppercase letter only
\p{N}     — Any number (Nd + Nl + No)
\p{Nd}    — Decimal digit only
\p{P}     — Any punctuation
\p{S}     — Any symbol
\p{Z}     — Any separator
\p{C}     — Any "other" (control, format, surrogate, private use, unassigned)

In Python, you need the regex package (not the built-in re) to use \p{} syntax, or you can use unicodedata.category() directly.

Why General Categories Matter

1. Identifier Validation

Programming languages define valid identifier characters using General Categories. For example, Python 3's identifier rules follow Unicode Standard Annex #31 (UAX #31):

  • Start characters: Lu, Ll, Lt, Lm, Lo, Nl, plus underscore
  • Continue characters: all start characters plus Mn, Mc, Nd, Pc

This means café = 42 is valid Python, because é is Ll (lowercase letter), and price₹ = 100 is valid because is not — wait, is Sc (currency symbol), so it is not a valid identifier character. General Category is the gatekeeper.

2. Word Segmentation

The Unicode word-break algorithm (UAX #29) uses General Category to decide where words begin and end. Letters (L) and combining marks (M) stick together; spaces (Zs) and punctuation (P) create breaks. This is how Ctrl+Shift+Left selects a whole word in your editor.

3. Security — Confusable Detection

Unicode Technical Report #36 (Unicode Security Mechanisms) relies on General Category to flag suspicious text. A string that mixes Lo characters from different scripts (e.g., Latin "a" and Cyrillic "а") is a confusable candidate. Category checks are the first filter in any mixed-script detection algorithm.

4. Text Rendering

Font engines use General Category to decide rendering behavior. Mn (non-spacing mark) characters must be rendered on top of the preceding base character. Zs characters produce whitespace. Cf characters are invisible but affect shaping (like ZWJ in emoji sequences).

5. Data Cleaning and Validation

When cleaning user input, General Category tells you what to keep and what to strip:

import unicodedata

def clean_text(text: str) -> str:
    """Remove control characters and unassigned code points."""
    return "".join(
        c for c in text
        if unicodedata.category(c) not in ("Cc", "Cf", "Cs", "Co", "Cn")
        or c in ("\n", "\t")  # Keep common whitespace controls
    )

Edge Cases and Surprises

Emoji are So (Other Symbol) — despite being rendered as colorful images, base emoji characters have the General Category So. Emoji modifiers and ZWJ sequences involve Cf and Sk characters as well.

Digits from other scripts are all Nd — Bengali ৩ (U+09E9), Thai ๓ (U+0E53), and Devanagari ३ (U+0969) are all Nd, just like ASCII 3. They all have Numeric_Value=3. However, most parsers only accept ASCII digits for numbers.

Underscore is Pc (Connector Punctuation) — not a letter, not a symbol. This is why it requires special handling in identifier rules: Pc is explicitly added to the "allowed" set for continue characters.

Surrogates (Cs) are ghosts — they exist in the code point space but should never appear as characters in well-formed text. They are artifacts of UTF-16 encoding.

The category of an unassigned code point is Cn — and it can change when a future Unicode version assigns a character to that code point. Code that hard-codes behavior based on Cn may break when Unicode is updated.

Quick Reference Card

Task Category Test
Is it a letter? L (any of Lu, Ll, Lt, Lm, Lo)
Is it a digit? Nd
Is it whitespace? Zs, Zl, Zp (plus Cc for \t, \n, \r)
Is it punctuation? P (any of Pc, Pd, Ps, Pe, Pi, Pf, Po)
Is it a symbol? S (any of Sm, Sc, Sk, So)
Is it invisible? Cf (format), Cc (control), Cn (unassigned)
Is it a combining mark? M (any of Mn, Mc, Me)
Safe for identifiers? L, Nl (start) + Mn, Mc, Nd, Pc (continue)

Summary

  • Unicode defines 30 General Categories organized into 7 major classes (L, M, N, P, S, Z, C).
  • Every code point has exactly one General Category — it is a required, non-optional property.
  • Categories drive identifier rules, word segmentation, security analysis, rendering, and data cleaning.
  • Query categories with unicodedata.category() in Python, \p{Lu} in regex, or Character.getType() in Java.
  • The Lo category contains the majority of assigned characters (CJK, Hangul, scripts without case), while Cn (unassigned) covers most of the total code point space.
  • General Category is stable for assigned characters — once a code point is assigned, its category never changes in future Unicode versions.

Mais em Unicode Fundamentals

What is Unicode? A Complete Guide

Unicode is the universal character encoding standard that assigns a unique number …

UTF-8 Encoding Explained

UTF-8 is the dominant character encoding on the web, capable of representing …

UTF-8 vs UTF-16 vs UTF-32: When to Use Each

UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …

What is a Unicode Code Point?

A Unicode code point is the unique number assigned to each character …

Unicode Planes and the BMP

Unicode is divided into 17 planes, each containing up to 65,536 code …

Understanding Byte Order Mark (BOM)

The Byte Order Mark (BOM) is a special Unicode character used at …

Surrogate Pairs Explained

Surrogate pairs are a mechanism in UTF-16 that allows code points outside …

ASCII to Unicode: The Evolution of Character Encoding

ASCII defined 128 characters for the English alphabet and was the foundation …

Unicode Normalization: NFC, NFD, NFKC, NFKD

The same visible character can be represented by multiple different byte sequences …

The Unicode Bidirectional Algorithm

The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …

Understanding Unicode Blocks

Unicode blocks are contiguous ranges of code points grouped by script or …

Unicode Scripts: How Writing Systems are Organized

Unicode assigns every character to a script property that identifies the writing …

What are Combining Characters?

Combining characters are Unicode code points that attach to a preceding base …

Grapheme Clusters vs Code Points

A single visible character on screen — called a grapheme cluster — …

Unicode Confusables: A Security Guide

Unicode confusables are characters that look identical or nearly identical to others, …

Zero Width Characters: What They Are and Why They Matter

Zero-width characters are invisible Unicode code points that affect text layout, joining, …

Unicode Whitespace Characters Guide

Unicode defines over two dozen whitespace characters beyond the ordinary space, including …

History of Unicode

Unicode began in 1987 as a collaboration between engineers at Apple and …

Unicode Versions Timeline

Unicode has released major versions regularly since 1.0 in 1991, with each …