Programmation et développement

Expression régulière Unicode

Modèles de regex utilisant les propriétés Unicode : \p{L} (toute lettre), \p{Script=Greek} (écriture grecque), \p{Emoji}. Le support varie selon le langage et le moteur de regex.

2024-04-15 · Updated 2025-06-09

What Are Unicode Regular Expressions?

Unicode regular expressions extend the classical regular expression model to handle Unicode text correctly. The key feature is Unicode property escapes — assertions that match characters based on their Unicode properties rather than specific character ranges.

The most important syntax is \p{Property} (matches characters with the given property) and \P{Property} (matches characters without it). A complementary syntax \p{Property=Value} matches specific property values.

JavaScript Unicode Property Escapes (ES2018)

JavaScript's u flag enables Unicode mode; the v flag (ES2024) adds set notation:

// Script property
/\p{Script=Latin}/u.test("a")       // true
/\p{Script=Han}/u.test("中")         // true
/\p{Script=Hiragana}/u.test("あ")   // true

// General Category
/\p{Letter}/u.test("a")             // true
/\p{Decimal_Number}/u.test("5")     // true
/\p{Emoji}/u.test("😀")             // true

// Derived properties
/\p{Lowercase_Letter}/u.test("a")   // true
/\p{Uppercase_Letter}/u.test("A")   // true

// Matching all CJK characters
const cjkPattern = /\p{Script=Han}+/u;
cjkPattern.exec("Hello 世界!")       // ["世界"]

// Negation
/\P{ASCII}/u.test("café")           // true (é is non-ASCII)

// Named Unicode blocks (v flag)
/[\p{Script=Greek}&&\p{Lowercase_Letter}]/v.test("α")  // true

Python Unicode Regex (`regex` module)

Python's built-in re module has limited Unicode support. The third-party regex module provides full Unicode property support:

import regex  # pip install regex

# Match Unicode letters (any script)
regex.findall(r"\p{L}+", "Hello, 世界, مرحبا")
# ["Hello", "世界", "مرحبا"]

# Match Unicode digits (includes Arabic-Indic, etc.)
regex.findall(r"\p{Nd}+", "Price: ١٢٣ or 123")
# ["١٢٣", "123"]

# Script-specific
regex.findall(r"\p{Script=Arabic}", "مرحبا Hello")
# ["م", "ر", "ح", "ب", "ا"]

# Category: punctuation
regex.findall(r"\p{P}+", "Hello, world! How's it?")
# [",", "!", "'", "?"]

# Standard re: only basic Unicode support
import re
re.findall(r"\w+", "Hello, 世界", flags=re.UNICODE)
# ["Hello", "世界"]  — \w matches Unicode letters with re.UNICODE

Key Unicode Properties for Regex

Property	Example	Matches
`\p{L}`	Letters	Any letter in any script
`\p{Lu}`	Uppercase	A–Z, À, Ω, etc.
`\p{Ll}`	Lowercase	a–z, à, ω, etc.
`\p{N}`	Numbers	Digits, numerals
`\p{Nd}`	Decimal digit	0–9 and script-native digits
`\p{P}`	Punctuation	.,!? and Unicode punct
`\p{S}`	Symbols	Currency, math, emoji
`\p{Z}`	Separators	Space, line sep, para sep
`\p{Script=Han}`	CJK characters	Chinese/Japanese/Korean ideographs
`\p{Emoji}`	Emoji	All emoji characters

Practical Validation Examples

// Validate that a username contains only letters, digits, underscores
// Works for all scripts, not just ASCII
function isValidUsername(name) {
  return /^[\p{L}\p{N}_]{3,30}$/u.test(name);
}

isValidUsername("user_123")    // true
isValidUsername("用户名")       // true (Chinese letters)
isValidUsername("<script>")    // false

// Extract hashtags including non-Latin
function extractHashtags(text) {
  return [...text.matchAll(/#\p{L}[\p{L}\p{N}]*/gu)].map(m => m[0]);
}

extractHashtags("Check #Unicode and #유니코드!")
// ["#Unicode", "#유니코드"]

Grapheme Cluster Matching

The \X pattern in the regex module matches a full grapheme cluster:

import regex

# \X matches one grapheme (not one code point)
regex.findall(r"\X", "café")          # ["c", "a", "f", "é"]  (4 graphemes)
regex.findall(r"\X", "👨‍👩‍👧")      # ["👨‍👩‍👧"]  (1 grapheme)
regex.findall(r"\X", "🇺🇸 flag")    # ["🇺🇸", " ", "f", "l", "a", "g"]

Quick Facts

Property	Value
JS syntax	`\p{Property}` with `/u` or `/v` flag
JS introduced	ES2018 for `\p{}`, ES2024 for set notation (`/v`)
Python built-in `re`	Limited: `\w`, `\d` respect Unicode with `re.UNICODE`
Python full support	`regex` module (PyPI): `\p{L}`, `\X`, script properties
Grapheme matching	`\X` in `regex` module
Most useful properties	`L` (letter), `N` (number), `Script=X`, `Emoji`

Plus dans Programmation et développement

Ambiguïté de longueur de chaîne

La « longueur » d'une chaîne Unicode dépend de l'unité : unités …

Caractère de remplacement

U+FFFD (�). Affiché lorsqu'un décodeur rencontre des séquences d'octets invalides — le …

Caractère invisible

Tout caractère sans glyphe visible : espaces blancs, caractères de largeur nulle, …

Caractère nul

U+0000 (NUL). Le premier caractère Unicode/ASCII, utilisé comme terminateur de chaîne en …

Chaîne de caractères

Une séquence de caractères dans un langage de programmation. La représentation interne …

Encodage / Décodage

L'encodage convertit les caractères en octets (str.encode('utf-8')) ; le décodage convertit les …

Java Unicode

Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …

Mojibake

Texte illisible résultant du décodage d'octets avec le mauvais encodage. Terme japonais …

Paire de substitution

Deux unités de code de 16 bits (un substitut haut U+D800–U+DBFF + …

Python Unicode

Python 3 uses Unicode strings by default (str = UTF-8 internally via …

← Retour au glossaire