التعبير النمطي ليونيكود
أنماط Regex باستخدام خصائص Unicode: \p{L} (أي حرف)، \p{Script=Greek} (نص يوناني)، \p{Emoji}. الدعم يختلف حسب اللغة ومحرك Regex.
What Are Unicode Regular Expressions?
Unicode regular expressions extend the classical regular expression model to handle Unicode text correctly. The key feature is Unicode property escapes — assertions that match characters based on their Unicode properties rather than specific character ranges.
The most important syntax is \p{Property} (matches characters with the given property) and \P{Property} (matches characters without it). A complementary syntax \p{Property=Value} matches specific property values.
JavaScript Unicode Property Escapes (ES2018)
JavaScript's u flag enables Unicode mode; the v flag (ES2024) adds set notation:
// Script property
/\p{Script=Latin}/u.test("a") // true
/\p{Script=Han}/u.test("中") // true
/\p{Script=Hiragana}/u.test("あ") // true
// General Category
/\p{Letter}/u.test("a") // true
/\p{Decimal_Number}/u.test("5") // true
/\p{Emoji}/u.test("😀") // true
// Derived properties
/\p{Lowercase_Letter}/u.test("a") // true
/\p{Uppercase_Letter}/u.test("A") // true
// Matching all CJK characters
const cjkPattern = /\p{Script=Han}+/u;
cjkPattern.exec("Hello 世界!") // ["世界"]
// Negation
/\P{ASCII}/u.test("café") // true (é is non-ASCII)
// Named Unicode blocks (v flag)
/[\p{Script=Greek}&&\p{Lowercase_Letter}]/v.test("α") // true
Python Unicode Regex (regex module)
Python's built-in re module has limited Unicode support. The third-party regex module provides full Unicode property support:
import regex # pip install regex
# Match Unicode letters (any script)
regex.findall(r"\p{L}+", "Hello, 世界, مرحبا")
# ["Hello", "世界", "مرحبا"]
# Match Unicode digits (includes Arabic-Indic, etc.)
regex.findall(r"\p{Nd}+", "Price: ١٢٣ or 123")
# ["١٢٣", "123"]
# Script-specific
regex.findall(r"\p{Script=Arabic}", "مرحبا Hello")
# ["م", "ر", "ح", "ب", "ا"]
# Category: punctuation
regex.findall(r"\p{P}+", "Hello, world! How's it?")
# [",", "!", "'", "?"]
# Standard re: only basic Unicode support
import re
re.findall(r"\w+", "Hello, 世界", flags=re.UNICODE)
# ["Hello", "世界"] — \w matches Unicode letters with re.UNICODE
Key Unicode Properties for Regex
| Property | Example | Matches |
|---|---|---|
\p{L} |
Letters | Any letter in any script |
\p{Lu} |
Uppercase | A–Z, À, Ω, etc. |
\p{Ll} |
Lowercase | a–z, à, ω, etc. |
\p{N} |
Numbers | Digits, numerals |
\p{Nd} |
Decimal digit | 0–9 and script-native digits |
\p{P} |
Punctuation | .,!? and Unicode punct |
\p{S} |
Symbols | Currency, math, emoji |
\p{Z} |
Separators | Space, line sep, para sep |
\p{Script=Han} |
CJK characters | Chinese/Japanese/Korean ideographs |
\p{Emoji} |
Emoji | All emoji characters |
Practical Validation Examples
// Validate that a username contains only letters, digits, underscores
// Works for all scripts, not just ASCII
function isValidUsername(name) {
return /^[\p{L}\p{N}_]{3,30}$/u.test(name);
}
isValidUsername("user_123") // true
isValidUsername("用户名") // true (Chinese letters)
isValidUsername("<script>") // false
// Extract hashtags including non-Latin
function extractHashtags(text) {
return [...text.matchAll(/#\p{L}[\p{L}\p{N}]*/gu)].map(m => m[0]);
}
extractHashtags("Check #Unicode and #유니코드!")
// ["#Unicode", "#유니코드"]
Grapheme Cluster Matching
The \X pattern in the regex module matches a full grapheme cluster:
import regex
# \X matches one grapheme (not one code point)
regex.findall(r"\X", "café") # ["c", "a", "f", "é"] (4 graphemes)
regex.findall(r"\X", "👨👩👧") # ["👨👩👧"] (1 grapheme)
regex.findall(r"\X", "🇺🇸 flag") # ["🇺🇸", " ", "f", "l", "a", "g"]
Quick Facts
| Property | Value |
|---|---|
| JS syntax | \p{Property} with /u or /v flag |
| JS introduced | ES2018 for \p{}, ES2024 for set notation (/v) |
Python built-in re |
Limited: \w, \d respect Unicode with re.UNICODE |
| Python full support | regex module (PyPI): \p{L}, \X, script properties |
| Grapheme matching | \X in regex module |
| Most useful properties | L (letter), N (number), Script=X, Emoji |
المزيد في البرمجة والتطوير
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …
نص مشوّه ناتج عن فك تشفير البايتات بترميز خاطئ. مصطلح ياباني (文字化け). …
Python 3 uses Unicode strings by default (str = UTF-8 internally via …
Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …
الترميز يحوّل الأحرف إلى بايتات (str.encode('utf-8'))؛ فك الترميز يحوّل البايتات إلى أحرف …
صيغة لتمثيل أحرف Unicode في الكود المصدري. تختلف حسب اللغة: \u2713 (Python/Java/JS)، …
U+FFFD (�). يُعرض عندما يواجه فاك التشفير تسلسلات بايتات غير صالحة — …
U+0000 (NUL). أول حرف في Unicode/ASCII، يُستخدم كمُنهٍ للنصوص في C/C++. خطر …
أي حرف بدون شكل رسومي مرئي: مسافات بيضاء، أحرف بعرض صفري، أحرف …
وحدتا ترميز 16-بت (بديل علوي U+D800–U+DBFF + بديل سفلي U+DC00–U+DFFF) يُمثلان معاً …