The Developer's Unicode Handbook · Глава 5

Input Validation Done Right

Validating international text is harder than it looks. Unicode categories, identifier rules, email address validation, and IDNA2008 — this chapter provides the definitive guide to Unicode-aware input validation.

~4 000 слов · ~16 мин чтения · · Updated

Input validation is your first line of defense against malformed data, security vulnerabilities, and confused users. Unicode makes input validation dramatically harder than it looks. A regex that rejects "invalid" email addresses may reject perfectly valid internationalized addresses. A username field that looks clean may contain invisible characters that create account confusion. A phone number field may accept numerals in scripts your downstream system can't handle. This chapter covers the right way to validate Unicode input.

The Email Regex That Rejects Valid Emails

Most email validation regexes are simply wrong for international email addresses (IDNA). RFC 5321 allows internationalized domain names, and RFC 6532 (EAI) allows Unicode in the local part too.

import re

# The classic email regex — WRONG for international addresses
NAIVE_EMAIL_RE = re.compile(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$")

# These are all valid email addresses that the naive regex rejects:
valid_emails = [
    "用户@例子.广告",            # Chinese IDN
    "адрес@пример.рф",         # Russian IDN
    "tëst@ëxample.com",        # Accented characters in local part (EAI)
    "user@münchen.de",         # Accented domain
    "[email protected]",  # Subaddressing
]

for email in valid_emails:
    print(f"{email}: {bool(NAIVE_EMAIL_RE.match(email))}")
    # Most will print False — all are valid!

# Better approach: use a library that handles IDN and EAI
# pip install email-validator
from email_validator import validate_email, EmailNotValidError

for email in valid_emails:
    try:
        info = validate_email(email, check_deliverability=False)
        print(f"{email}: VALID (normalized: {info.normalized})")
    except EmailNotValidError as e:
        print(f"{email}: INVALID — {e}")

For most applications, the right approach is: 1. Use a proper library (email-validator in Python, validator.js in JS). 2. Send a confirmation email. Deliverability is the real validation. 3. Store the normalized form returned by the library.

URL Validation with IDN

Internationalized Domain Names (IDN) encode Unicode domain names using Punycode for DNS compatibility. münchen.de becomes xn--mnchen-3ya.de in DNS. User input can come in either form.

from urllib.parse import urlparse
import idna  # pip install idna

def validate_url(url: str) -> str | None:
    # Validate and normalize a URL, handling IDN domains.
    try:
        parsed = urlparse(url)
        if parsed.scheme not in ("http", "https"):
            return None

        # Encode IDN hostname to ASCII (raises if invalid)
        hostname = parsed.hostname
        if hostname is None:
            return None

        # Try IDN encoding — raises InvalidCodepoint for invalid Unicode hostnames
        ascii_hostname = idna.encode(hostname, alg="transitional").decode("ascii")

        # Reconstruct URL with ASCII hostname (for DNS resolution)
        # But keep original Unicode hostname for display
        return url  # Return original for display; use ascii_hostname for DNS

    except (idna.core.InvalidCodepoint, ValueError):
        return None

# Test
print(validate_url("https://münchen.de/path"))       # Valid IDN
print(validate_url("https://例子.com/path"))          # Valid IDN
print(validate_url("https://invalid..com"))           # Invalid (double dot)
print(validate_url("http://xn--nxasmq6b.com"))        # Valid Punycode form

Security warning: Homograph attacks use IDN domains that look like legitimate domains. pаypal.com (with Cyrillic а) looks identical to paypal.com but is a different domain. When displaying URLs to users, consider using the Punycode form for non-Latin scripts to make the difference visible.

Phone Numbers with Unicode Digits

Unicode defines digit characters in many scripts. Arabic-Indic digits (٠١٢٣٤٥٦٧٨٩), Devanagari digits (०१२३४५), and others are valid Unicode numbers. A naive phone number validator that accepts only ASCII digits will reject valid input from Arab and South Asian users.

import unicodedata
import re

# Normalize digits: convert any Unicode digit to ASCII
def normalize_digits(text: str) -> str:
    # Convert Unicode digits to ASCII digits.
    result = []
    for char in text:
        if unicodedata.category(char) == "Nd":  # Decimal digit
            # Get the numeric value and convert to ASCII
            digit_value = unicodedata.digit(char)
            result.append(str(digit_value))
        else:
            result.append(char)
    return "".join(result)

# Test with Arabic-Indic digits
arabic_phone = "٠١٢-٣٤٥٦-٧٨٩٠"  # ٠١٢-٣٤٥٦-٧٨٩٠
normalized = normalize_digits(arabic_phone)
print(normalized)  # 012-3456-7890

# For production phone validation, use phonenumbers library
# pip install phonenumbers
import phonenumbers

def validate_phone(phone_input: str, default_region: str = "US") -> str | None:
    # Validate phone number, accepting Unicode digits.
    normalized_input = normalize_digits(phone_input)
    try:
        parsed = phonenumbers.parse(normalized_input, default_region)
        if phonenumbers.is_valid_number(parsed):
            return phonenumbers.format_number(
                parsed, phonenumbers.PhoneNumberFormat.E164
            )
    except phonenumbers.phonenumberutil.NumberParseException:
        pass
    return None

print(validate_phone("+1 (555) 123-4567"))  # +15551234567
print(validate_phone("٠٥٥٥١٢٣٤٥٦٧", "US"))  # Handles Arabic digits

Usernames with Confusables

Confusable characters are the username security nightmare: аdmin with a Cyrillic а looks identical to admin with Latin a, but they're different usernames. An attacker can register аdmin and confuse users into thinking they're talking to the legitimate admin account.

import unicodedata
import re

def detect_mixed_scripts(text: str) -> bool:
    # Return True if text mixes multiple Unicode scripts (potential confusable).
    scripts: set[str] = set()
    for char in text:
        # Get script property (requires unicodedata2 or icu for full support)
        # Simplified: check character categories
        cat = unicodedata.category(char)
        if cat.startswith("L"):  # Letter
            # Basic script detection by codepoint range
            cp = ord(char)
            if 0x0041 <= cp <= 0x007A:   # Basic Latin letters
                scripts.add("Latin")
            elif 0x0400 <= cp <= 0x04FF:  # Cyrillic
                scripts.add("Cyrillic")
            elif 0x0370 <= cp <= 0x03FF:  # Greek
                scripts.add("Greek")
            # etc.
    return len(scripts) > 1

# More complete solution: use the `confusable_homoglyphs` library
# pip install confusable-homoglyphs
from confusable_homoglyphs import confusables, categories

def is_potentially_confusable(username: str) -> bool:
    # Check if username contains characters that could be confused with others.
    for char in username:
        if confusables.is_confusable(char):
            return True
    return False

# Username normalization strategy:
def normalize_username_for_uniqueness(username: str) -> str:
    # Returns a canonical form for uniqueness checking.
    Store the original for display; use this for dedup checks.
    # NFKD + casefold + strip + normalize spaces
    text = unicodedata.normalize("NFKD", username)
    text = text.casefold()
    text = re.sub(r"\\s+", " ", text).strip()
    return text

# Two "different" usernames that should conflict:
u1 = "Admin"
u2 = "аdmin"  # Cyrillic а

print(normalize_username_for_uniqueness(u1))  # admin
print(normalize_username_for_uniqueness(u2))  # аdmin  ← still different after NFKD!
# Note: confusables are NOT normalized by Unicode — need explicit confusable detection

Zero-Width Characters in Form Input

Zero-width characters (ZWC) are invisible Unicode characters that can appear in user input without the user knowing. They can bypass filters, cause hash collisions, and create phantom account names:

Character Codepoint Name
U+200B ZWSP Zero Width Space
U+200C ZWNJ Zero Width Non-Joiner
U+200D ZWJ Zero Width Joiner
U+FEFF BOM Zero Width No-Break Space
U+2060 WJ Word Joiner
ZERO_WIDTH_CHARS = [
    "\\u200B",  # Zero Width Space
    "\\u200C",  # Zero Width Non-Joiner
    "\\u200D",  # Zero Width Joiner
    "\\uFEFF",  # Zero Width No-Break Space / BOM
    "\\u2060",  # Word Joiner
    "\\u180E",  # Mongolian Vowel Separator
    "\\u00AD",  # Soft Hyphen (invisible in most fonts)
]

ZERO_WIDTH_RE = re.compile("[" + "".join(ZERO_WIDTH_CHARS) + "]")

def strip_zero_width(text: str) -> str:
    # Remove zero-width characters from user input.
    return ZERO_WIDTH_RE.sub("", text)

def has_invisible_characters(text: str) -> bool:
    # Detect presence of invisible/zero-width characters.
    return bool(ZERO_WIDTH_RE.search(text))

# Test
username = "admin\\u200B"  # "admin" + zero-width space
print(username == "admin")              # False — looks same, is different!
print(has_invisible_characters(username))  # True
print(strip_zero_width(username) == "admin")  # True

Bidi Override Characters: The Security Threat

Bidirectional override characters can reverse text direction, making filenames and code appear different from what they actually are. This is a real attack vector (CVE-2021-42574, "Trojan Source"):

# Bidi control characters to detect and strip in user input
BIDI_CONTROL_CHARS = [
    "\\u200E",  # Left-to-Right Mark
    "\\u200F",  # Right-to-Left Mark
    "\\u202A",  # Left-to-Right Embedding
    "\\u202B",  # Right-to-Left Embedding
    "\\u202C",  # Pop Directional Formatting
    "\\u202D",  # Left-to-Right Override
    "\\u202E",  # Right-to-Left Override  ← the dangerous one
    "\\u2066",  # Left-to-Right Isolate
    "\\u2067",  # Right-to-Left Isolate
    "\\u2068",  # First Strong Isolate
    "\\u2069",  # Pop Directional Isolate
]

BIDI_RE = re.compile("[" + "".join(BIDI_CONTROL_CHARS) + "]")

def sanitize_input(text: str) -> str:
    # Sanitize user input: remove invisible chars and bidi overrides.
    Apply to all user-provided text before storing.
    # Remove bidi overrides
    text = BIDI_RE.sub("", text)
    # Remove zero-width characters
    text = ZERO_WIDTH_RE.sub("", text)
    # Normalize unicode
    text = unicodedata.normalize("NFC", text)
    # Strip leading/trailing whitespace
    text = text.strip()
    return text

A Practical Validation Checklist

For any user-facing input field, run through this checklist:

  1. Normalize first: Apply NFC (or NFKC if you need compatibility folding) before any validation.
  2. Strip invisible characters: Remove zero-width chars and bidi overrides.
  3. Check length in grapheme clusters: Don't use .length or len() for "max 50 characters" limits.
  4. Use purpose-built validators: Email → email-validator, URLs → idna + urlparse, phone → phonenumbers.
  5. Confusable usernames: Check for mixed scripts and confusable homoglyphs in security-sensitive identifiers.
  6. Store normalized form: What you store is what you compare later. Normalize once at write time.

The most dangerous validation mistake is the silent rejection: a form that rejects perfectly valid input without telling the user why. This is both a security failure (you didn't think through the attack surface) and a UX failure (you've excluded legitimate users whose names aren't ASCII).