Unicode in Passwords and Authentication

Passwords and usernames are the gatekeepers of digital identity, and Unicode adds layers of complexity that most authentication systems handle poorly. Can a user whose password contains an accented character log in from a device that composes that character differently? Should "John" (Latin) and "Јohn" (Cyrillic J + Latin) be treated as different usernames? What happens when a user's name contains characters that the system's database, hashing algorithm, or comparison logic mishandles? This guide covers the intersection of Unicode and authentication, focusing on the PRECIS framework (RFC 8264), normalization requirements, and practical implementation guidance.

The Problem

Consider a user who sets their password to "cafe" with an accented e. There are two ways to represent this in Unicode:

Representation	Code Points	Bytes (UTF-8)
Precomposed	U+0063 U+0061 U+0066 U+00E9	63 61 66 C3 A9
Decomposed	U+0063 U+0061 U+0066 U+0065 U+0301	63 61 66 65 CC 81

Both display as "cafe" but produce different byte sequences. If the system stores the password hash from one form and the user's device submits the other, authentication fails even though the user typed the correct password.

This is not a theoretical concern. macOS (HFS+) historically stored filenames in NFD (decomposed) form, while Windows uses NFC (precomposed). A password manager syncing between platforms could silently change the byte representation of a password.

Additional Complications

Issue	Example
Width variants	Fullwidth "Ａ" (U+FF21) vs. "A" (U+0041)
Case folding	Turkish "I" lowercases to "ı" (U+0131), not "i" (U+0069)
Invisible characters	Password with trailing ZWSP (U+200B)
Bidirectional	Password mixing LTR and RTL characters
Confusables	Username "admin" (Latin) vs. "аdmin" (Cyrillic а)

The PRECIS Framework (RFC 8264)

PRECIS (Preparation, Enforcement, and Comparison of Internationalized Strings) is the IETF standard for handling Unicode strings in security-critical contexts. It supersedes the older Stringprep (RFC 3454) and defines profiles for different use cases.

PRECIS String Classes

Class	Purpose	Base Rule
FreeformClass	Passwords, display names	Broad character allowance
IdentifierClass	Usernames, domain labels	Restricted to "letter-like" characters

Key PRECIS Profiles

Profile	RFC	Use Case	Class
UsernameCaseMapped	RFC 8265	Case-insensitive usernames	IdentifierClass
UsernameCasePreserved	RFC 8265	Case-sensitive usernames	IdentifierClass
OpaqueString	RFC 8265	Passwords	FreeformClass
Nickname	RFC 8266	Display names	FreeformClass

Password Processing (OpaqueString Profile)

The OpaqueString profile for passwords performs these steps:

Width Mapping: Map fullwidth/halfwidth variants to their normal forms
No Case Mapping: Passwords are case-sensitive; do NOT apply case folding
Normalization: Apply NFC normalization
Prohibited Check: Reject old-style C0/C1 control characters, surrogates
BiDi Rule: Check Bidi rule compliance (RFC 5893)

# Simplified OpaqueString (password) processing
import unicodedata

def prepare_password(password):
    # Step 1: Check for prohibited characters
    for ch in password:
        cat = unicodedata.category(ch)
        # Reject most control characters (Cc) except some
        if cat == "Cc" and ch not in ("\t",):
            raise ValueError(f"Prohibited control character: U+{ord(ch):04X}")
        # Reject surrogates
        if cat == "Cs":
            raise ValueError("Surrogate code point")

    # Step 2: Apply NFC normalization
    normalized = unicodedata.normalize("NFC", password)

    # Step 3: Width mapping (NFKC would do this, but passwords use NFC)
    # For passwords, NFC is preferred to preserve user intent

    # Step 4: Check for empty result
    if not normalized or normalized.strip() == "":
        raise ValueError("Password cannot be empty or whitespace-only")

    return normalized

Username Processing (UsernameCaseMapped Profile)

import unicodedata

def prepare_username(username):
    # Step 1: Width mapping (fullwidth -> normal)
    mapped = unicodedata.normalize("NFKC", username)

    # Step 2: Case mapping (lowercase)
    mapped = mapped.casefold()

    # Step 3: Normalize to NFC
    mapped = unicodedata.normalize("NFC", mapped)

    # Step 4: Check prohibited characters
    for ch in mapped:
        cat = unicodedata.category(ch)
        # Reject spaces (Zs), controls (Cc/Cf), symbols (Sk/So), etc.
        if cat in ("Zs", "Cc", "Cs"):
            raise ValueError(f"Prohibited character category: {cat}")

    # Step 5: Check for empty result
    if not mapped:
        raise ValueError("Username cannot be empty")

    return mapped

Normalization in Authentication

Which Normalization Form?

Context	Form	Reason
Passwords	NFC	Preserves user intent, consistent representation
Usernames	NFKC + casefold	Maximum compatibility, case-insensitive matching
Display names	NFC	Preserves visual appearance
Domain names	NFKC	Required by IDNA2008

The Critical Rule

Normalize BEFORE hashing. If you normalize after hashing, it is too late — different byte representations will produce different hashes regardless.

import hashlib
import unicodedata

def hash_password(password, salt):
    # ALWAYS normalize first
    normalized = unicodedata.normalize("NFC", password)
    # Then hash
    return hashlib.pbkdf2_hmac(
        "sha256",
        normalized.encode("utf-8"),
        salt,
        iterations=600_000,
    )

What Can Go Wrong

Scenario	Problem	Result
No normalization	NFC and NFD forms hash differently	Login failure
Normalize only at registration	Form changes between reg and login	Login failure
NFKC for passwords	Destroys distinctions user intended	Security reduction
casefold() for passwords	"Pa$$WORD" becomes "pa$$word"	Security reduction
Normalize after hashing	Hash of un-normalized bytes	No benefit

Practical Implementation

Complete Authentication Flow

import unicodedata
import hashlib
import os

class UnicodeAuthenticator:
    @staticmethod
    def prepare_username(username):
        # 1. NFKC normalization (width mapping + compatibility)
        result = unicodedata.normalize("NFKC", username)
        # 2. Case fold
        result = result.casefold()
        # 3. NFC (after casefold may produce non-NFC)
        result = unicodedata.normalize("NFC", result)
        # 4. Strip leading/trailing whitespace
        result = result.strip()
        # 5. Reject empty
        if not result:
            raise ValueError("Username cannot be empty")
        # 6. Reject control characters
        for ch in result:
            if unicodedata.category(ch) in ("Cc", "Cf", "Cs"):
                raise ValueError(f"Invalid character: U+{ord(ch):04X}")
        return result

    @staticmethod
    def prepare_password(password):
        # 1. NFC normalization only (preserve case, preserve intent)
        result = unicodedata.normalize("NFC", password)
        # 2. Reject empty
        if not result:
            raise ValueError("Password cannot be empty")
        # 3. Reject control characters except tab
        for ch in result:
            if unicodedata.category(ch) == "Cc" and ch != "\t":
                raise ValueError(f"Invalid control character")
        return result

    @staticmethod
    def hash_password(password, salt=None):
        if salt is None:
            salt = os.urandom(32)
        prepared = UnicodeAuthenticator.prepare_password(password)
        hashed = hashlib.pbkdf2_hmac(
            "sha256",
            prepared.encode("utf-8"),
            salt,
            iterations=600_000,
        )
        return salt, hashed

    @staticmethod
    def verify_password(password, salt, expected_hash):
        prepared = UnicodeAuthenticator.prepare_password(password)
        computed = hashlib.pbkdf2_hmac(
            "sha256",
            prepared.encode("utf-8"),
            salt,
            iterations=600_000,
        )
        return computed == expected_hash

Database Considerations

Setting	Recommendation
Column encoding	UTF-8 (utf8mb4 in MySQL)
Collation	Binary or C collation for passwords; language-aware for usernames
Max length	Measure in bytes, not characters (multi-byte chars expand length)
Unique constraints	Apply on normalized form, not raw input

Testing Unicode Authentication

# Test cases every Unicode-aware auth system should pass
test_cases = [
    # NFC vs NFD equivalence
    ("caf\u00E9", "cafe\u0301", True),  # cafe == cafe

    # Case sensitivity in passwords
    ("Password", "password", False),  # should NOT match for passwords

    # Width variants
    ("\uFF21\uFF22\uFF23", "ABC", True),  # fullwidth ABC == ABC (for usernames)

    # Invisible characters
    ("admin\u200B", "admin", False),  # trailing ZWSP (should be stripped or rejected)

    # Turkish I problem
    ("Istanbul", "istanbul", True),  # casefold handles this
    # Note: casefold() maps I -> i (not Turkish-aware by default)
]

Key Takeaways

Unicode creates authentication pitfalls through multiple representations of the same visual text (NFC vs. NFD), invisible characters in credentials, case folding variations across locales, and confusable characters in usernames.
The PRECIS framework (RFC 8264/8265) defines standard profiles for processing passwords (OpaqueString: NFC, no case mapping) and usernames (UsernameCaseMapped: NFKC + casefold).
Normalize BEFORE hashing — this is the single most critical rule. Different normalization forms produce different byte sequences and therefore different hashes.
Use NFC for passwords (preserves user intent) and NFKC + casefold for case-insensitive usernames (maximum compatibility).
Databases must use UTF-8 encoding with binary collation for password hashes and apply unique constraints on normalized usernames, not raw input.
Every authentication system should include test cases for NFC/NFD equivalence, width variants, invisible character handling, and locale-specific case folding.