🔒 Unicode Security

Unicode in Passwords and Authentication

Unicode passwords introduce normalization ambiguity that can cause authentication failures or allow password bypasses when different normalization forms produce different byte sequences for the same visible password. This guide covers the security implications of Unicode in authentication systems and best practices for normalizing and hashing Unicode passwords.

·

Passwords and usernames are the gatekeepers of digital identity, and Unicode adds layers of complexity that most authentication systems handle poorly. Can a user whose password contains an accented character log in from a device that composes that character differently? Should "John" (Latin) and "Јohn" (Cyrillic J + Latin) be treated as different usernames? What happens when a user's name contains characters that the system's database, hashing algorithm, or comparison logic mishandles? This guide covers the intersection of Unicode and authentication, focusing on the PRECIS framework (RFC 8264), normalization requirements, and practical implementation guidance.

The Problem

Consider a user who sets their password to "cafe" with an accented e. There are two ways to represent this in Unicode:

Representation Code Points Bytes (UTF-8)
Precomposed U+0063 U+0061 U+0066 U+00E9 63 61 66 C3 A9
Decomposed U+0063 U+0061 U+0066 U+0065 U+0301 63 61 66 65 CC 81

Both display as "cafe" but produce different byte sequences. If the system stores the password hash from one form and the user's device submits the other, authentication fails even though the user typed the correct password.

This is not a theoretical concern. macOS (HFS+) historically stored filenames in NFD (decomposed) form, while Windows uses NFC (precomposed). A password manager syncing between platforms could silently change the byte representation of a password.

Additional Complications

Issue Example
Width variants Fullwidth "A" (U+FF21) vs. "A" (U+0041)
Case folding Turkish "I" lowercases to "ı" (U+0131), not "i" (U+0069)
Invisible characters Password with trailing ZWSP (U+200B)
Bidirectional Password mixing LTR and RTL characters
Confusables Username "admin" (Latin) vs. "аdmin" (Cyrillic а)

The PRECIS Framework (RFC 8264)

PRECIS (Preparation, Enforcement, and Comparison of Internationalized Strings) is the IETF standard for handling Unicode strings in security-critical contexts. It supersedes the older Stringprep (RFC 3454) and defines profiles for different use cases.

PRECIS String Classes

Class Purpose Base Rule
FreeformClass Passwords, display names Broad character allowance
IdentifierClass Usernames, domain labels Restricted to "letter-like" characters

Key PRECIS Profiles

Profile RFC Use Case Class
UsernameCaseMapped RFC 8265 Case-insensitive usernames IdentifierClass
UsernameCasePreserved RFC 8265 Case-sensitive usernames IdentifierClass
OpaqueString RFC 8265 Passwords FreeformClass
Nickname RFC 8266 Display names FreeformClass

Password Processing (OpaqueString Profile)

The OpaqueString profile for passwords performs these steps:

  1. Width Mapping: Map fullwidth/halfwidth variants to their normal forms
  2. No Case Mapping: Passwords are case-sensitive; do NOT apply case folding
  3. Normalization: Apply NFC normalization
  4. Prohibited Check: Reject old-style C0/C1 control characters, surrogates
  5. BiDi Rule: Check Bidi rule compliance (RFC 5893)
# Simplified OpaqueString (password) processing
import unicodedata

def prepare_password(password):
    # Step 1: Check for prohibited characters
    for ch in password:
        cat = unicodedata.category(ch)
        # Reject most control characters (Cc) except some
        if cat == "Cc" and ch not in ("\t",):
            raise ValueError(f"Prohibited control character: U+{ord(ch):04X}")
        # Reject surrogates
        if cat == "Cs":
            raise ValueError("Surrogate code point")

    # Step 2: Apply NFC normalization
    normalized = unicodedata.normalize("NFC", password)

    # Step 3: Width mapping (NFKC would do this, but passwords use NFC)
    # For passwords, NFC is preferred to preserve user intent

    # Step 4: Check for empty result
    if not normalized or normalized.strip() == "":
        raise ValueError("Password cannot be empty or whitespace-only")

    return normalized

Username Processing (UsernameCaseMapped Profile)

import unicodedata

def prepare_username(username):
    # Step 1: Width mapping (fullwidth -> normal)
    mapped = unicodedata.normalize("NFKC", username)

    # Step 2: Case mapping (lowercase)
    mapped = mapped.casefold()

    # Step 3: Normalize to NFC
    mapped = unicodedata.normalize("NFC", mapped)

    # Step 4: Check prohibited characters
    for ch in mapped:
        cat = unicodedata.category(ch)
        # Reject spaces (Zs), controls (Cc/Cf), symbols (Sk/So), etc.
        if cat in ("Zs", "Cc", "Cs"):
            raise ValueError(f"Prohibited character category: {cat}")

    # Step 5: Check for empty result
    if not mapped:
        raise ValueError("Username cannot be empty")

    return mapped

Normalization in Authentication

Which Normalization Form?

Context Form Reason
Passwords NFC Preserves user intent, consistent representation
Usernames NFKC + casefold Maximum compatibility, case-insensitive matching
Display names NFC Preserves visual appearance
Domain names NFKC Required by IDNA2008

The Critical Rule

Normalize BEFORE hashing. If you normalize after hashing, it is too late — different byte representations will produce different hashes regardless.

import hashlib
import unicodedata

def hash_password(password, salt):
    # ALWAYS normalize first
    normalized = unicodedata.normalize("NFC", password)
    # Then hash
    return hashlib.pbkdf2_hmac(
        "sha256",
        normalized.encode("utf-8"),
        salt,
        iterations=600_000,
    )

What Can Go Wrong

Scenario Problem Result
No normalization NFC and NFD forms hash differently Login failure
Normalize only at registration Form changes between reg and login Login failure
NFKC for passwords Destroys distinctions user intended Security reduction
casefold() for passwords "Pa$$WORD" becomes "pa$$word" Security reduction
Normalize after hashing Hash of un-normalized bytes No benefit

Practical Implementation

Complete Authentication Flow

import unicodedata
import hashlib
import os

class UnicodeAuthenticator:
    @staticmethod
    def prepare_username(username):
        # 1. NFKC normalization (width mapping + compatibility)
        result = unicodedata.normalize("NFKC", username)
        # 2. Case fold
        result = result.casefold()
        # 3. NFC (after casefold may produce non-NFC)
        result = unicodedata.normalize("NFC", result)
        # 4. Strip leading/trailing whitespace
        result = result.strip()
        # 5. Reject empty
        if not result:
            raise ValueError("Username cannot be empty")
        # 6. Reject control characters
        for ch in result:
            if unicodedata.category(ch) in ("Cc", "Cf", "Cs"):
                raise ValueError(f"Invalid character: U+{ord(ch):04X}")
        return result

    @staticmethod
    def prepare_password(password):
        # 1. NFC normalization only (preserve case, preserve intent)
        result = unicodedata.normalize("NFC", password)
        # 2. Reject empty
        if not result:
            raise ValueError("Password cannot be empty")
        # 3. Reject control characters except tab
        for ch in result:
            if unicodedata.category(ch) == "Cc" and ch != "\t":
                raise ValueError(f"Invalid control character")
        return result

    @staticmethod
    def hash_password(password, salt=None):
        if salt is None:
            salt = os.urandom(32)
        prepared = UnicodeAuthenticator.prepare_password(password)
        hashed = hashlib.pbkdf2_hmac(
            "sha256",
            prepared.encode("utf-8"),
            salt,
            iterations=600_000,
        )
        return salt, hashed

    @staticmethod
    def verify_password(password, salt, expected_hash):
        prepared = UnicodeAuthenticator.prepare_password(password)
        computed = hashlib.pbkdf2_hmac(
            "sha256",
            prepared.encode("utf-8"),
            salt,
            iterations=600_000,
        )
        return computed == expected_hash

Database Considerations

Setting Recommendation
Column encoding UTF-8 (utf8mb4 in MySQL)
Collation Binary or C collation for passwords; language-aware for usernames
Max length Measure in bytes, not characters (multi-byte chars expand length)
Unique constraints Apply on normalized form, not raw input

Testing Unicode Authentication

# Test cases every Unicode-aware auth system should pass
test_cases = [
    # NFC vs NFD equivalence
    ("caf\u00E9", "cafe\u0301", True),  # cafe == cafe

    # Case sensitivity in passwords
    ("Password", "password", False),  # should NOT match for passwords

    # Width variants
    ("\uFF21\uFF22\uFF23", "ABC", True),  # fullwidth ABC == ABC (for usernames)

    # Invisible characters
    ("admin\u200B", "admin", False),  # trailing ZWSP (should be stripped or rejected)

    # Turkish I problem
    ("Istanbul", "istanbul", True),  # casefold handles this
    # Note: casefold() maps I -> i (not Turkish-aware by default)
]

Key Takeaways

  • Unicode creates authentication pitfalls through multiple representations of the same visual text (NFC vs. NFD), invisible characters in credentials, case folding variations across locales, and confusable characters in usernames.
  • The PRECIS framework (RFC 8264/8265) defines standard profiles for processing passwords (OpaqueString: NFC, no case mapping) and usernames (UsernameCaseMapped: NFKC + casefold).
  • Normalize BEFORE hashing — this is the single most critical rule. Different normalization forms produce different byte sequences and therefore different hashes.
  • Use NFC for passwords (preserves user intent) and NFKC + casefold for case-insensitive usernames (maximum compatibility).
  • Databases must use UTF-8 encoding with binary collation for password hashes and apply unique constraints on normalized usernames, not raw input.
  • Every authentication system should include test cases for NFC/NFD equivalence, width variants, invisible character handling, and locale-specific case folding.

เพิ่มเติมใน Unicode Security