The Developer's Unicode Handbook · Глава 5
Input Validation Done Right
Validating international text is harder than it looks. Unicode categories, identifier rules, email address validation, and IDNA2008 — this chapter provides the definitive guide to Unicode-aware input validation.
Input validation is your first line of defense against malformed data, security vulnerabilities, and confused users. Unicode makes input validation dramatically harder than it looks. A regex that rejects "invalid" email addresses may reject perfectly valid internationalized addresses. A username field that looks clean may contain invisible characters that create account confusion. A phone number field may accept numerals in scripts your downstream system can't handle. This chapter covers the right way to validate Unicode input.
The Email Regex That Rejects Valid Emails
Most email validation regexes are simply wrong for international email addresses (IDNA). RFC 5321 allows internationalized domain names, and RFC 6532 (EAI) allows Unicode in the local part too.
import re
# The classic email regex — WRONG for international addresses
NAIVE_EMAIL_RE = re.compile(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$")
# These are all valid email addresses that the naive regex rejects:
valid_emails = [
"用户@例子.广告", # Chinese IDN
"адрес@пример.рф", # Russian IDN
"tëst@ëxample.com", # Accented characters in local part (EAI)
"user@münchen.de", # Accented domain
"[email protected]", # Subaddressing
]
for email in valid_emails:
print(f"{email}: {bool(NAIVE_EMAIL_RE.match(email))}")
# Most will print False — all are valid!
# Better approach: use a library that handles IDN and EAI
# pip install email-validator
from email_validator import validate_email, EmailNotValidError
for email in valid_emails:
try:
info = validate_email(email, check_deliverability=False)
print(f"{email}: VALID (normalized: {info.normalized})")
except EmailNotValidError as e:
print(f"{email}: INVALID — {e}")
For most applications, the right approach is:
1. Use a proper library (email-validator in Python, validator.js in JS).
2. Send a confirmation email. Deliverability is the real validation.
3. Store the normalized form returned by the library.
URL Validation with IDN
Internationalized Domain Names (IDN) encode Unicode domain names using Punycode for DNS compatibility. münchen.de becomes xn--mnchen-3ya.de in DNS. User input can come in either form.
from urllib.parse import urlparse
import idna # pip install idna
def validate_url(url: str) -> str | None:
# Validate and normalize a URL, handling IDN domains.
try:
parsed = urlparse(url)
if parsed.scheme not in ("http", "https"):
return None
# Encode IDN hostname to ASCII (raises if invalid)
hostname = parsed.hostname
if hostname is None:
return None
# Try IDN encoding — raises InvalidCodepoint for invalid Unicode hostnames
ascii_hostname = idna.encode(hostname, alg="transitional").decode("ascii")
# Reconstruct URL with ASCII hostname (for DNS resolution)
# But keep original Unicode hostname for display
return url # Return original for display; use ascii_hostname for DNS
except (idna.core.InvalidCodepoint, ValueError):
return None
# Test
print(validate_url("https://münchen.de/path")) # Valid IDN
print(validate_url("https://例子.com/path")) # Valid IDN
print(validate_url("https://invalid..com")) # Invalid (double dot)
print(validate_url("http://xn--nxasmq6b.com")) # Valid Punycode form
Security warning: Homograph attacks use IDN domains that look like legitimate domains. pаypal.com (with Cyrillic а) looks identical to paypal.com but is a different domain. When displaying URLs to users, consider using the Punycode form for non-Latin scripts to make the difference visible.
Phone Numbers with Unicode Digits
Unicode defines digit characters in many scripts. Arabic-Indic digits (٠١٢٣٤٥٦٧٨٩), Devanagari digits (०१२३४५), and others are valid Unicode numbers. A naive phone number validator that accepts only ASCII digits will reject valid input from Arab and South Asian users.
import unicodedata
import re
# Normalize digits: convert any Unicode digit to ASCII
def normalize_digits(text: str) -> str:
# Convert Unicode digits to ASCII digits.
result = []
for char in text:
if unicodedata.category(char) == "Nd": # Decimal digit
# Get the numeric value and convert to ASCII
digit_value = unicodedata.digit(char)
result.append(str(digit_value))
else:
result.append(char)
return "".join(result)
# Test with Arabic-Indic digits
arabic_phone = "٠١٢-٣٤٥٦-٧٨٩٠" # ٠١٢-٣٤٥٦-٧٨٩٠
normalized = normalize_digits(arabic_phone)
print(normalized) # 012-3456-7890
# For production phone validation, use phonenumbers library
# pip install phonenumbers
import phonenumbers
def validate_phone(phone_input: str, default_region: str = "US") -> str | None:
# Validate phone number, accepting Unicode digits.
normalized_input = normalize_digits(phone_input)
try:
parsed = phonenumbers.parse(normalized_input, default_region)
if phonenumbers.is_valid_number(parsed):
return phonenumbers.format_number(
parsed, phonenumbers.PhoneNumberFormat.E164
)
except phonenumbers.phonenumberutil.NumberParseException:
pass
return None
print(validate_phone("+1 (555) 123-4567")) # +15551234567
print(validate_phone("٠٥٥٥١٢٣٤٥٦٧", "US")) # Handles Arabic digits
Usernames with Confusables
Confusable characters are the username security nightmare: аdmin with a Cyrillic а looks identical to admin with Latin a, but they're different usernames. An attacker can register аdmin and confuse users into thinking they're talking to the legitimate admin account.
import unicodedata
import re
def detect_mixed_scripts(text: str) -> bool:
# Return True if text mixes multiple Unicode scripts (potential confusable).
scripts: set[str] = set()
for char in text:
# Get script property (requires unicodedata2 or icu for full support)
# Simplified: check character categories
cat = unicodedata.category(char)
if cat.startswith("L"): # Letter
# Basic script detection by codepoint range
cp = ord(char)
if 0x0041 <= cp <= 0x007A: # Basic Latin letters
scripts.add("Latin")
elif 0x0400 <= cp <= 0x04FF: # Cyrillic
scripts.add("Cyrillic")
elif 0x0370 <= cp <= 0x03FF: # Greek
scripts.add("Greek")
# etc.
return len(scripts) > 1
# More complete solution: use the `confusable_homoglyphs` library
# pip install confusable-homoglyphs
from confusable_homoglyphs import confusables, categories
def is_potentially_confusable(username: str) -> bool:
# Check if username contains characters that could be confused with others.
for char in username:
if confusables.is_confusable(char):
return True
return False
# Username normalization strategy:
def normalize_username_for_uniqueness(username: str) -> str:
# Returns a canonical form for uniqueness checking.
Store the original for display; use this for dedup checks.
# NFKD + casefold + strip + normalize spaces
text = unicodedata.normalize("NFKD", username)
text = text.casefold()
text = re.sub(r"\\s+", " ", text).strip()
return text
# Two "different" usernames that should conflict:
u1 = "Admin"
u2 = "аdmin" # Cyrillic а
print(normalize_username_for_uniqueness(u1)) # admin
print(normalize_username_for_uniqueness(u2)) # аdmin ← still different after NFKD!
# Note: confusables are NOT normalized by Unicode — need explicit confusable detection
Zero-Width Characters in Form Input
Zero-width characters (ZWC) are invisible Unicode characters that can appear in user input without the user knowing. They can bypass filters, cause hash collisions, and create phantom account names:
| Character | Codepoint | Name |
|---|---|---|
| U+200B | ZWSP | Zero Width Space |
| U+200C | ZWNJ | Zero Width Non-Joiner |
| U+200D | ZWJ | Zero Width Joiner |
| U+FEFF | BOM | Zero Width No-Break Space |
| U+2060 | WJ | Word Joiner |
ZERO_WIDTH_CHARS = [
"\\u200B", # Zero Width Space
"\\u200C", # Zero Width Non-Joiner
"\\u200D", # Zero Width Joiner
"\\uFEFF", # Zero Width No-Break Space / BOM
"\\u2060", # Word Joiner
"\\u180E", # Mongolian Vowel Separator
"\\u00AD", # Soft Hyphen (invisible in most fonts)
]
ZERO_WIDTH_RE = re.compile("[" + "".join(ZERO_WIDTH_CHARS) + "]")
def strip_zero_width(text: str) -> str:
# Remove zero-width characters from user input.
return ZERO_WIDTH_RE.sub("", text)
def has_invisible_characters(text: str) -> bool:
# Detect presence of invisible/zero-width characters.
return bool(ZERO_WIDTH_RE.search(text))
# Test
username = "admin\\u200B" # "admin" + zero-width space
print(username == "admin") # False — looks same, is different!
print(has_invisible_characters(username)) # True
print(strip_zero_width(username) == "admin") # True
Bidi Override Characters: The Security Threat
Bidirectional override characters can reverse text direction, making filenames and code appear different from what they actually are. This is a real attack vector (CVE-2021-42574, "Trojan Source"):
# Bidi control characters to detect and strip in user input
BIDI_CONTROL_CHARS = [
"\\u200E", # Left-to-Right Mark
"\\u200F", # Right-to-Left Mark
"\\u202A", # Left-to-Right Embedding
"\\u202B", # Right-to-Left Embedding
"\\u202C", # Pop Directional Formatting
"\\u202D", # Left-to-Right Override
"\\u202E", # Right-to-Left Override ← the dangerous one
"\\u2066", # Left-to-Right Isolate
"\\u2067", # Right-to-Left Isolate
"\\u2068", # First Strong Isolate
"\\u2069", # Pop Directional Isolate
]
BIDI_RE = re.compile("[" + "".join(BIDI_CONTROL_CHARS) + "]")
def sanitize_input(text: str) -> str:
# Sanitize user input: remove invisible chars and bidi overrides.
Apply to all user-provided text before storing.
# Remove bidi overrides
text = BIDI_RE.sub("", text)
# Remove zero-width characters
text = ZERO_WIDTH_RE.sub("", text)
# Normalize unicode
text = unicodedata.normalize("NFC", text)
# Strip leading/trailing whitespace
text = text.strip()
return text
A Practical Validation Checklist
For any user-facing input field, run through this checklist:
- Normalize first: Apply NFC (or NFKC if you need compatibility folding) before any validation.
- Strip invisible characters: Remove zero-width chars and bidi overrides.
- Check length in grapheme clusters: Don't use
.lengthorlen()for "max 50 characters" limits. - Use purpose-built validators: Email →
email-validator, URLs →idna+ urlparse, phone →phonenumbers. - Confusable usernames: Check for mixed scripts and confusable homoglyphs in security-sensitive identifiers.
- Store normalized form: What you store is what you compare later. Normalize once at write time.
The most dangerous validation mistake is the silent rejection: a form that rejects perfectly valid input without telling the user why. This is both a security failure (you didn't think through the attack surface) and a UX failure (you've excluded legitimate users whose names aren't ASCII).