Unicode in Regular Expressions
Unicode-aware regular expressions let you match characters by script, category, or property rather than just explicit byte ranges, making patterns far more robust for international text. This guide covers Unicode regex features across Python, JavaScript, PCRE, and Java, including \p{} properties and the u flag.
Regular expressions were designed when ASCII was king. As Unicode became universal, regex engines evolved to handle the full range of scripts, categories, and properties. Today, both Python and JavaScript offer robust Unicode-aware regex — but only if you use the right flags and syntax. This guide covers Unicode property escapes, categories, case-insensitive matching, and practical patterns for real-world text processing.
Why Unicode Matters in Regex
A naïve character class like [a-z] only matches the 26 basic Latin letters.
It will not match é, ñ, ü, 日, or any other non-ASCII letter. Similarly,
\\w in an ASCII-mode regex matches only [a-zA-Z0-9_]. Modern Unicode-aware
\\w matches tens of thousands of word characters across all scripts.
import re
# ASCII-only \\w
re.findall(r'\\w+', 'café') # ['caf'] ← misses 'é' in some engines
# Python 3 is Unicode by default
re.findall(r'\\w+', 'café') # ['café'] ← correct in Python 3
Unicode Categories
The Unicode standard assigns every code point to one of 30 general categories, grouped into 7 major types. These categories are the foundation of Unicode regex property escapes.
Major Categories
| Category | Code | Description | Example |
|---|---|---|---|
| Letter | L |
Any letter | A, é, 日, α |
| Mark | M |
Combining marks | Diacritics, vowel marks |
| Number | N |
Numeric characters | 0, ½, ٣ |
| Punctuation | P |
Punctuation marks | ., !, ? |
| Symbol | S |
Symbols (math, currency) | ©, €, ∞ |
| Separator | Z |
Space and separators | SPACE, NBSP |
| Other | C |
Control, private-use | NUL, U+E000 |
Sub-categories
| Sub-cat | Code | Meaning |
|---|---|---|
| Uppercase letter | Lu |
A, Α, А |
| Lowercase letter | Ll |
a, α, а |
| Decimal digit | Nd |
0–9, Arabic-Indic digits |
| Space separator | Zs |
Regular space, no-break space |
| Math symbol | Sm |
+, =, ∞, ∑ |
| Currency symbol | Sc |
$, €, ¥, £ |
| Open punctuation | Ps |
(, [, { |
| Close punctuation | Pe |
), ], } |
Python: re and regex Modules
Built-in Unicode Behaviour (Python 3)
Python 3's re module operates in Unicode mode by default. The shorthands
\\w, \\d, \\s, and their uppercase inverses match Unicode characters:
import re
re.match(r'\\w+', 'héllo') # matches 'héllo' — é is a word char
re.match(r'\\d+', '٣٤٥') # matches Arabic-Indic digits
re.match(r'\\s+', '\\u00A0') # matches non-breaking space
To restrict to ASCII behaviour, use re.ASCII (or re.A):
re.match(r'\\w+', 'héllo', re.ASCII) # matches only 'h' — stops at é
The regex Module
The third-party regex module (install with pip install regex) adds full
\\p{} Unicode property escape support, which the standard re module lacks:
import regex
# \\p{Letter} — any Unicode letter
regex.findall(r'\\p{Letter}+', 'café résumé 日本語')
# ['café', 'résumé', '日本語']
# \\p{N} — any Unicode number (includes fractions, Roman numerals)
regex.findall(r'\\p{N}+', 'Score: 42 and Ⅷ and ½')
# ['42', 'Ⅷ', '½']
# \\p{Sc} — currency symbols
regex.findall(r'\\p{Sc}', '$42 and €15 and ¥100')
# ['$', '€', '¥']
# Negation with \\P{}
regex.sub(r'\\P{Letter}', '', 'Hello, World! 123')
# 'HelloWorld'
Script Properties
Script properties let you match characters from a specific writing system:
import regex
regex.findall(r'\\p{Script=Greek}+', 'α + β = γ in ελληνικά')
# ['α', 'β', 'γ', 'ελληνικά']
regex.findall(r'\\p{Script=Arabic}', 'مرحبا 123 hello')
# ['م', 'ر', 'ح', 'ب', 'ا']
regex.findall(r'\\p{Script=Han}+', 'Unicode 统一码 和 中文')
# ['统一码', '中文']
# Short form aliases work too
regex.findall(r'\\p{sc=Latn}+', 'Hello Привет')
# ['Hello']
Block Properties
Block properties match characters in a Unicode block range:
import regex
regex.findall(r'\\p{Block=Arrows}', 'Go → left ← and ↑ up')
# ['→', '←', '↑']
regex.findall(r'\\p{Block=Mathematical_Operators}+', '∑ ∞ ∫ ≤ ≥')
# ['∑', '∞', '∫', '≤', '≥']
JavaScript: /u and /v Flags
JavaScript's built-in regex engine gained Unicode property escapes in ES2018
via the /u flag. The newer /v flag (ES2024) adds set notation.
The /u Flag
Without /u, the dot . in regex matches any character except newline, but
only up to U+FFFF. With /u, it correctly treats supplementary characters
as single units:
// Without /u
/^.$/.test("🐍"); // false — emoji has .length === 2 in JS strings
// With /u
/^.$/u.test("🐍"); // true
\\p{} Property Escapes (ES2018+, requires /u)
// Any Unicode letter
/\\p{Letter}/u.test("é"); // true
/\\p{L}/u.test("日"); // true (L is short for Letter)
// Decimal digits in any script
/\\p{Nd}/u.test("٣"); // true (Arabic-Indic digit THREE)
// Currency symbols
/\\p{Sc}/u.test("€"); // true
// Emoji
/\\p{Emoji}/u.test("🐍"); // true
// Script matching
/\\p{Script=Greek}/u.test("α"); // true
/\\p{sc=Cyrl}/u.test("А"); // true (sc= is short for Script=)
// Negation with \\P{}
/\\P{ASCII}/u.test("é"); // true — é is not ASCII
Common Unicode Regex Patterns in JavaScript
// Match only letters in any script
const letters = /\\p{L}+/gu;
"café résumé".match(letters); // ["café", "résumé"]
// Match words including non-Latin letters
const word = /[\\p{L}\\p{N}_]+/gu;
"hello_42 café 日本語".match(word); // ["hello_42", "café", "日本語"]
// Strip non-letter, non-digit characters
const clean = str => str.replace(/[^\\p{L}\\p{N}]+/gu, ' ').trim();
clean("hello!!! café... 123"); // "hello café 123"
// Match any type of space
const space = /\\p{Z}/u;
space.test("\\u00A0"); // true — non-breaking space
space.test("\\u3000"); // true — ideographic space
The /v Flag (ES2024)
The v flag supersedes u and adds set notation:
// Intersection: letters that are also ASCII
/[\\p{Letter}&&\\p{ASCII}]/v.test("A"); // true
/[\\p{Letter}&&\\p{ASCII}]/v.test("é"); // false — é not in ASCII
// Difference: letters that are NOT ASCII (non-Latin script letters)
/[\\p{Letter}--\\p{ASCII}]/v.test("日"); // true
/[\\p{Letter}--\\p{ASCII}]/v.test("A"); // false
Case-Insensitive Matching
Unicode defines case folding rules for many scripts. Both Python and JavaScript support Unicode case-insensitive matching, but results differ:
import re, regex
# re — basic Unicode case folding
re.match(r'café', 'CAFÉ', re.IGNORECASE) # match
# regex — full Unicode case folding including special cases
regex.match(r'ß', 'SS', regex.IGNORECASE) # match (ß folds to SS in German)
regex.match(r'fi', 'fi', regex.IGNORECASE) # match (fi ligature)
// JavaScript /u with /i — Unicode case insensitive
/café/ui.test("CAFÉ") // true
/ß/ui.test("SS") // true in some engines (Unicode case folding)
Practical Patterns
import regex
# Validate a username that allows letters, digits, and underscore from any script
USERNAME = regex.compile(r'^[\\p{L}\\p{N}_]{3,30}$')
USERNAME.match('user_42') # match
USERNAME.match('用户名') # match — CJK username
# Find all prices in any currency
PRICE = regex.compile(r'\\p{Sc}\\d[\\d,]*(?:\.\\d+)?')
PRICE.findall('Pay $42.00 or €39 or ¥4500')
# ['$42.00', '€39', '¥4500']
# Strip accents (NFD decompose, remove combining marks)
import unicodedata
def strip_accents(text: str) -> str:
nfd = unicodedata.normalize('NFD', text)
return regex.sub(r'\\p{M}', '', nfd)
strip_accents('café résumé') # 'cafe resume'
Quick Reference
| Pattern | Python (regex) |
JavaScript (/u) |
Matches |
|---|---|---|---|
| Any letter | \\p{L} |
\\p{L} |
A, é, 日, α |
| Any digit | \\p{Nd} |
\\p{Nd} |
0–9, ٣, ३ |
| Any space | \\p{Z} |
\\p{Z} |
SPACE, NBSP, ideographic |
| Latin script | \\p{sc=Latn} |
\\p{sc=Latin} |
a–z, A–Z, é, ñ |
| Greek script | \\p{sc=Grek} |
\\p{sc=Greek} |
α, β, Ω |
| Emoji | \\p{Emoji} |
\\p{Emoji} |
🐍, ©, ★ |
| Currency | \\p{Sc} |
\\p{Sc} |
$, €, ¥, £ |
| Not a letter | \\P{L} |
\\P{L} |
anything non-letter |
Unicode regex transforms complex text processing tasks — detecting language scripts, validating international usernames, parsing multilingual documents — from fragile ASCII hacks into robust, standards-based patterns.
Unicode in Code içinde daha fazlası
Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …
JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …
Java's char type is a 16-bit UTF-16 code unit, not a full …
Go's string type is a sequence of bytes, and its rune type …
Rust's str and String types are guaranteed to be valid UTF-8, making …
C and C++ have historically poor Unicode support, with char being a …
Ruby strings carry an explicit encoding, with UTF-8 being the default since …
PHP's built-in string functions operate on bytes rather than Unicode characters, which …
Swift's String type is designed with Unicode correctness as a first-class concern, …
HTML and CSS support Unicode characters directly and through escape sequences, allowing …
SQL databases store text in encodings and collations that determine how characters …
URLs are technically restricted to ASCII characters, so non-ASCII text must be …
Every major programming language has its own syntax for embedding Unicode characters …
JSON is defined as Unicode text and must be encoded in UTF-8, …