प्रोग्रामिंग और विकास

यूनिकोड रेगुलर एक्सप्रेशन

Unicode properties का उपयोग करने वाले regex पैटर्न: \p{L} (कोई भी अक्षर), \p{Script=Greek} (Greek स्क्रिप्ट), \p{Emoji}। समर्थन भाषा और regex engine के अनुसार भिन्न होता है।

· Updated

What Are Unicode Regular Expressions?

Unicode regular expressions extend the classical regular expression model to handle Unicode text correctly. The key feature is Unicode property escapes — assertions that match characters based on their Unicode properties rather than specific character ranges.

The most important syntax is \p{Property} (matches characters with the given property) and \P{Property} (matches characters without it). A complementary syntax \p{Property=Value} matches specific property values.

JavaScript Unicode Property Escapes (ES2018)

JavaScript's u flag enables Unicode mode; the v flag (ES2024) adds set notation:

// Script property
/\p{Script=Latin}/u.test("a")       // true
/\p{Script=Han}/u.test("中")         // true
/\p{Script=Hiragana}/u.test("あ")   // true

// General Category
/\p{Letter}/u.test("a")             // true
/\p{Decimal_Number}/u.test("5")     // true
/\p{Emoji}/u.test("😀")             // true

// Derived properties
/\p{Lowercase_Letter}/u.test("a")   // true
/\p{Uppercase_Letter}/u.test("A")   // true

// Matching all CJK characters
const cjkPattern = /\p{Script=Han}+/u;
cjkPattern.exec("Hello 世界!")       // ["世界"]

// Negation
/\P{ASCII}/u.test("café")           // true (é is non-ASCII)

// Named Unicode blocks (v flag)
/[\p{Script=Greek}&&\p{Lowercase_Letter}]/v.test("α")  // true

Python Unicode Regex (regex module)

Python's built-in re module has limited Unicode support. The third-party regex module provides full Unicode property support:

import regex  # pip install regex

# Match Unicode letters (any script)
regex.findall(r"\p{L}+", "Hello, 世界, مرحبا")
# ["Hello", "世界", "مرحبا"]

# Match Unicode digits (includes Arabic-Indic, etc.)
regex.findall(r"\p{Nd}+", "Price: ١٢٣ or 123")
# ["١٢٣", "123"]

# Script-specific
regex.findall(r"\p{Script=Arabic}", "مرحبا Hello")
# ["م", "ر", "ح", "ب", "ا"]

# Category: punctuation
regex.findall(r"\p{P}+", "Hello, world! How's it?")
# [",", "!", "'", "?"]

# Standard re: only basic Unicode support
import re
re.findall(r"\w+", "Hello, 世界", flags=re.UNICODE)
# ["Hello", "世界"]  — \w matches Unicode letters with re.UNICODE

Key Unicode Properties for Regex

Property Example Matches
\p{L} Letters Any letter in any script
\p{Lu} Uppercase A–Z, À, Ω, etc.
\p{Ll} Lowercase a–z, à, ω, etc.
\p{N} Numbers Digits, numerals
\p{Nd} Decimal digit 0–9 and script-native digits
\p{P} Punctuation .,!? and Unicode punct
\p{S} Symbols Currency, math, emoji
\p{Z} Separators Space, line sep, para sep
\p{Script=Han} CJK characters Chinese/Japanese/Korean ideographs
\p{Emoji} Emoji All emoji characters

Practical Validation Examples

// Validate that a username contains only letters, digits, underscores
// Works for all scripts, not just ASCII
function isValidUsername(name) {
  return /^[\p{L}\p{N}_]{3,30}$/u.test(name);
}

isValidUsername("user_123")    // true
isValidUsername("用户名")       // true (Chinese letters)
isValidUsername("<script>")    // false

// Extract hashtags including non-Latin
function extractHashtags(text) {
  return [...text.matchAll(/#\p{L}[\p{L}\p{N}]*/gu)].map(m => m[0]);
}

extractHashtags("Check #Unicode and #유니코드!")
// ["#Unicode", "#유니코드"]

Grapheme Cluster Matching

The \X pattern in the regex module matches a full grapheme cluster:

import regex

# \X matches one grapheme (not one code point)
regex.findall(r"\X", "café")          # ["c", "a", "f", "é"]  (4 graphemes)
regex.findall(r"\X", "👨‍👩‍👧")      # ["👨‍👩‍👧"]  (1 grapheme)
regex.findall(r"\X", "🇺🇸 flag")    # ["🇺🇸", " ", "f", "l", "a", "g"]

Quick Facts

Property Value
JS syntax \p{Property} with /u or /v flag
JS introduced ES2018 for \p{}, ES2024 for set notation (/v)
Python built-in re Limited: \w, \d respect Unicode with re.UNICODE
Python full support regex module (PyPI): \p{L}, \X, script properties
Grapheme matching \X in regex module
Most useful properties L (letter), N (number), Script=X, Emoji

प्रोग्रामिंग और विकास में और