💻 Unicode in Code

Unicode in Regular Expressions

Unicode-aware regular expressions let you match characters by script, category, or property rather than just explicit byte ranges, making patterns far more robust for international text. This guide covers Unicode regex features across Python, JavaScript, PCRE, and Java, including \p{} properties and the u flag.

·

Regular expressions were designed when ASCII was king. As Unicode became universal, regex engines evolved to handle the full range of scripts, categories, and properties. Today, both Python and JavaScript offer robust Unicode-aware regex — but only if you use the right flags and syntax. This guide covers Unicode property escapes, categories, case-insensitive matching, and practical patterns for real-world text processing.

Why Unicode Matters in Regex

A naïve character class like [a-z] only matches the 26 basic Latin letters. It will not match é, ñ, ü, , or any other non-ASCII letter. Similarly, \\w in an ASCII-mode regex matches only [a-zA-Z0-9_]. Modern Unicode-aware \\w matches tens of thousands of word characters across all scripts.

import re

# ASCII-only \\w
re.findall(r'\\w+', 'café')       # ['caf']  ← misses 'é' in some engines
# Python 3 is Unicode by default
re.findall(r'\\w+', 'café')       # ['café']  ← correct in Python 3

Unicode Categories

The Unicode standard assigns every code point to one of 30 general categories, grouped into 7 major types. These categories are the foundation of Unicode regex property escapes.

Major Categories

Category Code Description Example
Letter L Any letter A, é, 日, α
Mark M Combining marks Diacritics, vowel marks
Number N Numeric characters 0, ½, ٣
Punctuation P Punctuation marks ., !, ?
Symbol S Symbols (math, currency) ©, €, ∞
Separator Z Space and separators SPACE, NBSP
Other C Control, private-use NUL, U+E000

Sub-categories

Sub-cat Code Meaning
Uppercase letter Lu A, Α, А
Lowercase letter Ll a, α, а
Decimal digit Nd 0–9, Arabic-Indic digits
Space separator Zs Regular space, no-break space
Math symbol Sm +, =, ∞, ∑
Currency symbol Sc $, €, ¥, £
Open punctuation Ps (, [, {
Close punctuation Pe ), ], }

Python: re and regex Modules

Built-in Unicode Behaviour (Python 3)

Python 3's re module operates in Unicode mode by default. The shorthands \\w, \\d, \\s, and their uppercase inverses match Unicode characters:

import re

re.match(r'\\w+', 'héllo')    # matches 'héllo' — é is a word char
re.match(r'\\d+', '٣٤٥')      # matches Arabic-Indic digits
re.match(r'\\s+', '\\u00A0')   # matches non-breaking space

To restrict to ASCII behaviour, use re.ASCII (or re.A):

re.match(r'\\w+', 'héllo', re.ASCII)   # matches only 'h' — stops at é

The regex Module

The third-party regex module (install with pip install regex) adds full \\p{} Unicode property escape support, which the standard re module lacks:

import regex

# \\p{Letter} — any Unicode letter
regex.findall(r'\\p{Letter}+', 'café résumé 日本語')
# ['café', 'résumé', '日本語']

# \\p{N} — any Unicode number (includes fractions, Roman numerals)
regex.findall(r'\\p{N}+', 'Score: 42 and Ⅷ and ½')
# ['42', 'Ⅷ', '½']

# \\p{Sc} — currency symbols
regex.findall(r'\\p{Sc}', '$42 and €15 and ¥100')
# ['$', '€', '¥']

# Negation with \\P{}
regex.sub(r'\\P{Letter}', '', 'Hello, World! 123')
# 'HelloWorld'

Script Properties

Script properties let you match characters from a specific writing system:

import regex

regex.findall(r'\\p{Script=Greek}+', 'α + β = γ in ελληνικά')
# ['α', 'β', 'γ', 'ελληνικά']

regex.findall(r'\\p{Script=Arabic}', 'مرحبا 123 hello')
# ['م', 'ر', 'ح', 'ب', 'ا']

regex.findall(r'\\p{Script=Han}+', 'Unicode 统一码 和 中文')
# ['统一码', '中文']

# Short form aliases work too
regex.findall(r'\\p{sc=Latn}+', 'Hello Привет')
# ['Hello']

Block Properties

Block properties match characters in a Unicode block range:

import regex

regex.findall(r'\\p{Block=Arrows}', 'Go → left ← and ↑ up')
# ['→', '←', '↑']

regex.findall(r'\\p{Block=Mathematical_Operators}+', '∑ ∞ ∫ ≤ ≥')
# ['∑', '∞', '∫', '≤', '≥']

JavaScript: /u and /v Flags

JavaScript's built-in regex engine gained Unicode property escapes in ES2018 via the /u flag. The newer /v flag (ES2024) adds set notation.

The /u Flag

Without /u, the dot . in regex matches any character except newline, but only up to U+FFFF. With /u, it correctly treats supplementary characters as single units:

// Without /u
/^.$/.test("🐍");   // false — emoji has .length === 2 in JS strings

// With /u
/^.$/u.test("🐍");  // true

\\p{} Property Escapes (ES2018+, requires /u)

// Any Unicode letter
/\\p{Letter}/u.test("é");       // true
/\\p{L}/u.test("日");           // true  (L is short for Letter)

// Decimal digits in any script
/\\p{Nd}/u.test("٣");           // true  (Arabic-Indic digit THREE)

// Currency symbols
/\\p{Sc}/u.test("€");           // true

// Emoji
/\\p{Emoji}/u.test("🐍");       // true

// Script matching
/\\p{Script=Greek}/u.test("α"); // true
/\\p{sc=Cyrl}/u.test("А");      // true  (sc= is short for Script=)

// Negation with \\P{}
/\\P{ASCII}/u.test("é");        // true — é is not ASCII

Common Unicode Regex Patterns in JavaScript

// Match only letters in any script
const letters = /\\p{L}+/gu;
"café résumé".match(letters);    // ["café", "résumé"]

// Match words including non-Latin letters
const word = /[\\p{L}\\p{N}_]+/gu;
"hello_42 café 日本語".match(word);  // ["hello_42", "café", "日本語"]

// Strip non-letter, non-digit characters
const clean = str => str.replace(/[^\\p{L}\\p{N}]+/gu, ' ').trim();
clean("hello!!! café... 123");   // "hello café 123"

// Match any type of space
const space = /\\p{Z}/u;
space.test("\\u00A0");   // true — non-breaking space
space.test("\\u3000");   // true — ideographic space

The /v Flag (ES2024)

The v flag supersedes u and adds set notation:

// Intersection: letters that are also ASCII
/[\\p{Letter}&&\\p{ASCII}]/v.test("A");   // true
/[\\p{Letter}&&\\p{ASCII}]/v.test("é");   // false — é not in ASCII

// Difference: letters that are NOT ASCII (non-Latin script letters)
/[\\p{Letter}--\\p{ASCII}]/v.test("日");  // true
/[\\p{Letter}--\\p{ASCII}]/v.test("A");   // false

Case-Insensitive Matching

Unicode defines case folding rules for many scripts. Both Python and JavaScript support Unicode case-insensitive matching, but results differ:

import re, regex

# re — basic Unicode case folding
re.match(r'café', 'CAFÉ', re.IGNORECASE)    # match

# regex — full Unicode case folding including special cases
regex.match(r'ß', 'SS', regex.IGNORECASE)   # match (ß folds to SS in German)
regex.match(r'fi', 'fi', regex.IGNORECASE)   # match (fi ligature)
// JavaScript /u with /i — Unicode case insensitive
/café/ui.test("CAFÉ")   // true
/ß/ui.test("SS")        // true in some engines (Unicode case folding)

Practical Patterns

import regex

# Validate a username that allows letters, digits, and underscore from any script
USERNAME = regex.compile(r'^[\\p{L}\\p{N}_]{3,30}$')
USERNAME.match('user_42')    # match
USERNAME.match('用户名')       # match — CJK username

# Find all prices in any currency
PRICE = regex.compile(r'\\p{Sc}\\d[\\d,]*(?:\.\\d+)?')
PRICE.findall('Pay $42.00 or €39 or ¥4500')
# ['$42.00', '€39', '¥4500']

# Strip accents (NFD decompose, remove combining marks)
import unicodedata
def strip_accents(text: str) -> str:
    nfd = unicodedata.normalize('NFD', text)
    return regex.sub(r'\\p{M}', '', nfd)
strip_accents('café résumé')   # 'cafe resume'

Quick Reference

Pattern Python (regex) JavaScript (/u) Matches
Any letter \\p{L} \\p{L} A, é, 日, α
Any digit \\p{Nd} \\p{Nd} 0–9, ٣, ३
Any space \\p{Z} \\p{Z} SPACE, NBSP, ideographic
Latin script \\p{sc=Latn} \\p{sc=Latin} a–z, A–Z, é, ñ
Greek script \\p{sc=Grek} \\p{sc=Greek} α, β, Ω
Emoji \\p{Emoji} \\p{Emoji} 🐍, ©, ★
Currency \\p{Sc} \\p{Sc} $, €, ¥, £
Not a letter \\P{L} \\P{L} anything non-letter

Unicode regex transforms complex text processing tasks — detecting language scripts, validating international usernames, parsing multilingual documents — from fragile ASCII hacks into robust, standards-based patterns.

Unicode in Code içinde daha fazlası