Expression régulière Unicode
Modèles de regex utilisant les propriétés Unicode : \p{L} (toute lettre), \p{Script=Greek} (écriture grecque), \p{Emoji}. Le support varie selon le langage et le moteur de regex.
What Are Unicode Regular Expressions?
Unicode regular expressions extend the classical regular expression model to handle Unicode text correctly. The key feature is Unicode property escapes — assertions that match characters based on their Unicode properties rather than specific character ranges.
The most important syntax is \p{Property} (matches characters with the given property) and \P{Property} (matches characters without it). A complementary syntax \p{Property=Value} matches specific property values.
JavaScript Unicode Property Escapes (ES2018)
JavaScript's u flag enables Unicode mode; the v flag (ES2024) adds set notation:
// Script property
/\p{Script=Latin}/u.test("a") // true
/\p{Script=Han}/u.test("中") // true
/\p{Script=Hiragana}/u.test("あ") // true
// General Category
/\p{Letter}/u.test("a") // true
/\p{Decimal_Number}/u.test("5") // true
/\p{Emoji}/u.test("😀") // true
// Derived properties
/\p{Lowercase_Letter}/u.test("a") // true
/\p{Uppercase_Letter}/u.test("A") // true
// Matching all CJK characters
const cjkPattern = /\p{Script=Han}+/u;
cjkPattern.exec("Hello 世界!") // ["世界"]
// Negation
/\P{ASCII}/u.test("café") // true (é is non-ASCII)
// Named Unicode blocks (v flag)
/[\p{Script=Greek}&&\p{Lowercase_Letter}]/v.test("α") // true
Python Unicode Regex (regex module)
Python's built-in re module has limited Unicode support. The third-party regex module provides full Unicode property support:
import regex # pip install regex
# Match Unicode letters (any script)
regex.findall(r"\p{L}+", "Hello, 世界, مرحبا")
# ["Hello", "世界", "مرحبا"]
# Match Unicode digits (includes Arabic-Indic, etc.)
regex.findall(r"\p{Nd}+", "Price: ١٢٣ or 123")
# ["١٢٣", "123"]
# Script-specific
regex.findall(r"\p{Script=Arabic}", "مرحبا Hello")
# ["م", "ر", "ح", "ب", "ا"]
# Category: punctuation
regex.findall(r"\p{P}+", "Hello, world! How's it?")
# [",", "!", "'", "?"]
# Standard re: only basic Unicode support
import re
re.findall(r"\w+", "Hello, 世界", flags=re.UNICODE)
# ["Hello", "世界"] — \w matches Unicode letters with re.UNICODE
Key Unicode Properties for Regex
| Property | Example | Matches |
|---|---|---|
\p{L} |
Letters | Any letter in any script |
\p{Lu} |
Uppercase | A–Z, À, Ω, etc. |
\p{Ll} |
Lowercase | a–z, à, ω, etc. |
\p{N} |
Numbers | Digits, numerals |
\p{Nd} |
Decimal digit | 0–9 and script-native digits |
\p{P} |
Punctuation | .,!? and Unicode punct |
\p{S} |
Symbols | Currency, math, emoji |
\p{Z} |
Separators | Space, line sep, para sep |
\p{Script=Han} |
CJK characters | Chinese/Japanese/Korean ideographs |
\p{Emoji} |
Emoji | All emoji characters |
Practical Validation Examples
// Validate that a username contains only letters, digits, underscores
// Works for all scripts, not just ASCII
function isValidUsername(name) {
return /^[\p{L}\p{N}_]{3,30}$/u.test(name);
}
isValidUsername("user_123") // true
isValidUsername("用户名") // true (Chinese letters)
isValidUsername("<script>") // false
// Extract hashtags including non-Latin
function extractHashtags(text) {
return [...text.matchAll(/#\p{L}[\p{L}\p{N}]*/gu)].map(m => m[0]);
}
extractHashtags("Check #Unicode and #유니코드!")
// ["#Unicode", "#유니코드"]
Grapheme Cluster Matching
The \X pattern in the regex module matches a full grapheme cluster:
import regex
# \X matches one grapheme (not one code point)
regex.findall(r"\X", "café") # ["c", "a", "f", "é"] (4 graphemes)
regex.findall(r"\X", "👨👩👧") # ["👨👩👧"] (1 grapheme)
regex.findall(r"\X", "🇺🇸 flag") # ["🇺🇸", " ", "f", "l", "a", "g"]
Quick Facts
| Property | Value |
|---|---|
| JS syntax | \p{Property} with /u or /v flag |
| JS introduced | ES2018 for \p{}, ES2024 for set notation (/v) |
Python built-in re |
Limited: \w, \d respect Unicode with re.UNICODE |
| Python full support | regex module (PyPI): \p{L}, \X, script properties |
| Grapheme matching | \X in regex module |
| Most useful properties | L (letter), N (number), Script=X, Emoji |
Plus dans Programmation et développement
La « longueur » d'une chaîne Unicode dépend de l'unité : unités …
U+FFFD (�). Affiché lorsqu'un décodeur rencontre des séquences d'octets invalides — le …
Tout caractère sans glyphe visible : espaces blancs, caractères de largeur nulle, …
U+0000 (NUL). Le premier caractère Unicode/ASCII, utilisé comme terminateur de chaîne en …
Une séquence de caractères dans un langage de programmation. La représentation interne …
L'encodage convertit les caractères en octets (str.encode('utf-8')) ; le décodage convertit les …
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …
Texte illisible résultant du décodage d'octets avec le mauvais encodage. Terme japonais …
Deux unités de code de 16 bits (un substitut haut U+D800–U+DBFF + …
Python 3 uses Unicode strings by default (str = UTF-8 internally via …