Unicode düzenli ifade
Unicode özelliklerini kullanan regex desenleri: \p{L} (herhangi bir harf), \p{Script=Greek} (Yunan alfabesi), \p{Emoji}. Destek dile ve regex motoruna göre değişir.
What Are Unicode Regular Expressions?
Unicode regular expressions extend the classical regular expression model to handle Unicode text correctly. The key feature is Unicode property escapes — assertions that match characters based on their Unicode properties rather than specific character ranges.
The most important syntax is \p{Property} (matches characters with the given property) and \P{Property} (matches characters without it). A complementary syntax \p{Property=Value} matches specific property values.
JavaScript Unicode Property Escapes (ES2018)
JavaScript's u flag enables Unicode mode; the v flag (ES2024) adds set notation:
// Script property
/\p{Script=Latin}/u.test("a") // true
/\p{Script=Han}/u.test("中") // true
/\p{Script=Hiragana}/u.test("あ") // true
// General Category
/\p{Letter}/u.test("a") // true
/\p{Decimal_Number}/u.test("5") // true
/\p{Emoji}/u.test("😀") // true
// Derived properties
/\p{Lowercase_Letter}/u.test("a") // true
/\p{Uppercase_Letter}/u.test("A") // true
// Matching all CJK characters
const cjkPattern = /\p{Script=Han}+/u;
cjkPattern.exec("Hello 世界!") // ["世界"]
// Negation
/\P{ASCII}/u.test("café") // true (é is non-ASCII)
// Named Unicode blocks (v flag)
/[\p{Script=Greek}&&\p{Lowercase_Letter}]/v.test("α") // true
Python Unicode Regex (regex module)
Python's built-in re module has limited Unicode support. The third-party regex module provides full Unicode property support:
import regex # pip install regex
# Match Unicode letters (any script)
regex.findall(r"\p{L}+", "Hello, 世界, مرحبا")
# ["Hello", "世界", "مرحبا"]
# Match Unicode digits (includes Arabic-Indic, etc.)
regex.findall(r"\p{Nd}+", "Price: ١٢٣ or 123")
# ["١٢٣", "123"]
# Script-specific
regex.findall(r"\p{Script=Arabic}", "مرحبا Hello")
# ["م", "ر", "ح", "ب", "ا"]
# Category: punctuation
regex.findall(r"\p{P}+", "Hello, world! How's it?")
# [",", "!", "'", "?"]
# Standard re: only basic Unicode support
import re
re.findall(r"\w+", "Hello, 世界", flags=re.UNICODE)
# ["Hello", "世界"] — \w matches Unicode letters with re.UNICODE
Key Unicode Properties for Regex
| Property | Example | Matches |
|---|---|---|
\p{L} |
Letters | Any letter in any script |
\p{Lu} |
Uppercase | A–Z, À, Ω, etc. |
\p{Ll} |
Lowercase | a–z, à, ω, etc. |
\p{N} |
Numbers | Digits, numerals |
\p{Nd} |
Decimal digit | 0–9 and script-native digits |
\p{P} |
Punctuation | .,!? and Unicode punct |
\p{S} |
Symbols | Currency, math, emoji |
\p{Z} |
Separators | Space, line sep, para sep |
\p{Script=Han} |
CJK characters | Chinese/Japanese/Korean ideographs |
\p{Emoji} |
Emoji | All emoji characters |
Practical Validation Examples
// Validate that a username contains only letters, digits, underscores
// Works for all scripts, not just ASCII
function isValidUsername(name) {
return /^[\p{L}\p{N}_]{3,30}$/u.test(name);
}
isValidUsername("user_123") // true
isValidUsername("用户名") // true (Chinese letters)
isValidUsername("<script>") // false
// Extract hashtags including non-Latin
function extractHashtags(text) {
return [...text.matchAll(/#\p{L}[\p{L}\p{N}]*/gu)].map(m => m[0]);
}
extractHashtags("Check #Unicode and #유니코드!")
// ["#Unicode", "#유니코드"]
Grapheme Cluster Matching
The \X pattern in the regex module matches a full grapheme cluster:
import regex
# \X matches one grapheme (not one code point)
regex.findall(r"\X", "café") # ["c", "a", "f", "é"] (4 graphemes)
regex.findall(r"\X", "👨👩👧") # ["👨👩👧"] (1 grapheme)
regex.findall(r"\X", "🇺🇸 flag") # ["🇺🇸", " ", "f", "l", "a", "g"]
Quick Facts
| Property | Value |
|---|---|
| JS syntax | \p{Property} with /u or /v flag |
| JS introduced | ES2018 for \p{}, ES2024 for set notation (/v) |
Python built-in re |
Limited: \w, \d respect Unicode with re.UNICODE |
| Python full support | regex module (PyPI): \p{L}, \X, script properties |
| Grapheme matching | \X in regex module |
| Most useful properties | L (letter), N (number), Script=X, Emoji |
Programlama ve Geliştirme içinde daha fazlası
U+0000 (NUL). İlk Unicode/ASCII karakteri, C/C++'da dizi sonlandırıcı olarak kullanılır. Güvenlik riski: …
Bir Unicode dizisinin 'uzunluğu' birime bağlıdır: kod birimleri (JavaScript .length), kod noktaları …
Görünür glifi olmayan herhangi bir karakter: boşluk, sıfır genişlikli karakterler, kontrol karakterleri …
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …
Bir programlama dilinde karakter dizisi. Dahili gösterim değişir: UTF-8 (Go, Rust, yeni …
Kodlama karakterleri baytlara dönüştürür (str.encode('utf-8')); kod çözme baytları karakterlere dönüştürür (bytes.decode('utf-8')). Bunu …
Baytların yanlış kodlama ile çözülmesinden kaynaklanan bozuk metin. Japonca terim (文字化け). Örnek: …
Python 3 uses Unicode strings by default (str = UTF-8 internally via …
Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …
Kaynak kodda Unicode karakterlerini temsil etme sözdizimi. Dile göre değişir: \u2713 (Python/Java/JS), …