Programação e desenvolvimento

Expressão regular Unicode

Padrões de regex usando propriedades Unicode: \p{L} (qualquer letra), \p{Script=Greek} (escrita grega), \p{Emoji}. O suporte varia por linguagem e motor de regex.

2024-04-15 · Updated 2025-06-09

What Are Unicode Regular Expressions?

Unicode regular expressions extend the classical regular expression model to handle Unicode text correctly. The key feature is Unicode property escapes — assertions that match characters based on their Unicode properties rather than specific character ranges.

The most important syntax is \p{Property} (matches characters with the given property) and \P{Property} (matches characters without it). A complementary syntax \p{Property=Value} matches specific property values.

JavaScript Unicode Property Escapes (ES2018)

JavaScript's u flag enables Unicode mode; the v flag (ES2024) adds set notation:

// Script property
/\p{Script=Latin}/u.test("a")       // true
/\p{Script=Han}/u.test("中")         // true
/\p{Script=Hiragana}/u.test("あ")   // true

// General Category
/\p{Letter}/u.test("a")             // true
/\p{Decimal_Number}/u.test("5")     // true
/\p{Emoji}/u.test("😀")             // true

// Derived properties
/\p{Lowercase_Letter}/u.test("a")   // true
/\p{Uppercase_Letter}/u.test("A")   // true

// Matching all CJK characters
const cjkPattern = /\p{Script=Han}+/u;
cjkPattern.exec("Hello 世界!")       // ["世界"]

// Negation
/\P{ASCII}/u.test("café")           // true (é is non-ASCII)

// Named Unicode blocks (v flag)
/[\p{Script=Greek}&&\p{Lowercase_Letter}]/v.test("α")  // true

Python Unicode Regex (`regex` module)

Python's built-in re module has limited Unicode support. The third-party regex module provides full Unicode property support:

import regex  # pip install regex

# Match Unicode letters (any script)
regex.findall(r"\p{L}+", "Hello, 世界, مرحبا")
# ["Hello", "世界", "مرحبا"]

# Match Unicode digits (includes Arabic-Indic, etc.)
regex.findall(r"\p{Nd}+", "Price: ١٢٣ or 123")
# ["١٢٣", "123"]

# Script-specific
regex.findall(r"\p{Script=Arabic}", "مرحبا Hello")
# ["م", "ر", "ح", "ب", "ا"]

# Category: punctuation
regex.findall(r"\p{P}+", "Hello, world! How's it?")
# [",", "!", "'", "?"]

# Standard re: only basic Unicode support
import re
re.findall(r"\w+", "Hello, 世界", flags=re.UNICODE)
# ["Hello", "世界"]  — \w matches Unicode letters with re.UNICODE

Key Unicode Properties for Regex

Property	Example	Matches
`\p{L}`	Letters	Any letter in any script
`\p{Lu}`	Uppercase	A–Z, À, Ω, etc.
`\p{Ll}`	Lowercase	a–z, à, ω, etc.
`\p{N}`	Numbers	Digits, numerals
`\p{Nd}`	Decimal digit	0–9 and script-native digits
`\p{P}`	Punctuation	.,!? and Unicode punct
`\p{S}`	Symbols	Currency, math, emoji
`\p{Z}`	Separators	Space, line sep, para sep
`\p{Script=Han}`	CJK characters	Chinese/Japanese/Korean ideographs
`\p{Emoji}`	Emoji	All emoji characters

Practical Validation Examples

// Validate that a username contains only letters, digits, underscores
// Works for all scripts, not just ASCII
function isValidUsername(name) {
  return /^[\p{L}\p{N}_]{3,30}$/u.test(name);
}

isValidUsername("user_123")    // true
isValidUsername("用户名")       // true (Chinese letters)
isValidUsername("<script>")    // false

// Extract hashtags including non-Latin
function extractHashtags(text) {
  return [...text.matchAll(/#\p{L}[\p{L}\p{N}]*/gu)].map(m => m[0]);
}

extractHashtags("Check #Unicode and #유니코드!")
// ["#Unicode", "#유니코드"]

Grapheme Cluster Matching

The \X pattern in the regex module matches a full grapheme cluster:

import regex

# \X matches one grapheme (not one code point)
regex.findall(r"\X", "café")          # ["c", "a", "f", "é"]  (4 graphemes)
regex.findall(r"\X", "👨‍👩‍👧")      # ["👨‍👩‍👧"]  (1 grapheme)
regex.findall(r"\X", "🇺🇸 flag")    # ["🇺🇸", " ", "f", "l", "a", "g"]

Quick Facts

Property	Value
JS syntax	`\p{Property}` with `/u` or `/v` flag
JS introduced	ES2018 for `\p{}`, ES2024 for set notation (`/v`)
Python built-in `re`	Limited: `\w`, `\d` respect Unicode with `re.UNICODE`
Python full support	`regex` module (PyPI): `\p{L}`, `\X`, script properties
Grapheme matching	`\X` in `regex` module
Most useful properties	`L` (letter), `N` (number), `Script=X`, `Emoji`

Mais em Programação e desenvolvimento

Ambiguidade de comprimento de string

O "comprimento" de uma string Unicode depende da unidade: unidades de código …

Cadeia de caracteres

Uma sequência de caracteres em uma linguagem de programação. A representação interna …

Caractere de substituição

U+FFFD (�). Exibido quando um decodificador encontra sequências de bytes inválidas — …

Caractere invisível

Qualquer caractere sem glifo visível: espaço em branco, caracteres de largura zero, …

Caractere nulo

U+0000 (NUL). O primeiro caractere Unicode/ASCII, usado como terminador de string em …

Codificação / Decodificação

A codificação converte caracteres em bytes (str.encode('utf-8')); a decodificação converte bytes em …

Java Unicode

Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …

Mojibake

Texto ilegível resultante da decodificação de bytes com a codificação errada. Termo …

Par substituto

Duas unidades de código de 16 bits (um substituto alto U+D800–U+DBFF + …

Python Unicode

Python 3 uses Unicode strings by default (str = UTF-8 internally via …

← Voltar ao Glossário