Unicode in Passwords and Authentication
Unicode passwords introduce normalization ambiguity that can cause authentication failures or allow password bypasses when different normalization forms produce different byte sequences for the same visible password. This guide covers the security implications of Unicode in authentication systems and best practices for normalizing and hashing Unicode passwords.
Passwords and usernames are the gatekeepers of digital identity, and Unicode adds layers of complexity that most authentication systems handle poorly. Can a user whose password contains an accented character log in from a device that composes that character differently? Should "John" (Latin) and "Јohn" (Cyrillic J + Latin) be treated as different usernames? What happens when a user's name contains characters that the system's database, hashing algorithm, or comparison logic mishandles? This guide covers the intersection of Unicode and authentication, focusing on the PRECIS framework (RFC 8264), normalization requirements, and practical implementation guidance.
The Problem
Consider a user who sets their password to "cafe" with an accented e. There are two ways to represent this in Unicode:
| Representation | Code Points | Bytes (UTF-8) |
|---|---|---|
| Precomposed | U+0063 U+0061 U+0066 U+00E9 | 63 61 66 C3 A9 |
| Decomposed | U+0063 U+0061 U+0066 U+0065 U+0301 | 63 61 66 65 CC 81 |
Both display as "cafe" but produce different byte sequences. If the system stores the password hash from one form and the user's device submits the other, authentication fails even though the user typed the correct password.
This is not a theoretical concern. macOS (HFS+) historically stored filenames in NFD (decomposed) form, while Windows uses NFC (precomposed). A password manager syncing between platforms could silently change the byte representation of a password.
Additional Complications
| Issue | Example |
|---|---|
| Width variants | Fullwidth "A" (U+FF21) vs. "A" (U+0041) |
| Case folding | Turkish "I" lowercases to "ı" (U+0131), not "i" (U+0069) |
| Invisible characters | Password with trailing ZWSP (U+200B) |
| Bidirectional | Password mixing LTR and RTL characters |
| Confusables | Username "admin" (Latin) vs. "аdmin" (Cyrillic а) |
The PRECIS Framework (RFC 8264)
PRECIS (Preparation, Enforcement, and Comparison of Internationalized Strings) is the IETF standard for handling Unicode strings in security-critical contexts. It supersedes the older Stringprep (RFC 3454) and defines profiles for different use cases.
PRECIS String Classes
| Class | Purpose | Base Rule |
|---|---|---|
| FreeformClass | Passwords, display names | Broad character allowance |
| IdentifierClass | Usernames, domain labels | Restricted to "letter-like" characters |
Key PRECIS Profiles
| Profile | RFC | Use Case | Class |
|---|---|---|---|
| UsernameCaseMapped | RFC 8265 | Case-insensitive usernames | IdentifierClass |
| UsernameCasePreserved | RFC 8265 | Case-sensitive usernames | IdentifierClass |
| OpaqueString | RFC 8265 | Passwords | FreeformClass |
| Nickname | RFC 8266 | Display names | FreeformClass |
Password Processing (OpaqueString Profile)
The OpaqueString profile for passwords performs these steps:
- Width Mapping: Map fullwidth/halfwidth variants to their normal forms
- No Case Mapping: Passwords are case-sensitive; do NOT apply case folding
- Normalization: Apply NFC normalization
- Prohibited Check: Reject old-style C0/C1 control characters, surrogates
- BiDi Rule: Check Bidi rule compliance (RFC 5893)
# Simplified OpaqueString (password) processing
import unicodedata
def prepare_password(password):
# Step 1: Check for prohibited characters
for ch in password:
cat = unicodedata.category(ch)
# Reject most control characters (Cc) except some
if cat == "Cc" and ch not in ("\t",):
raise ValueError(f"Prohibited control character: U+{ord(ch):04X}")
# Reject surrogates
if cat == "Cs":
raise ValueError("Surrogate code point")
# Step 2: Apply NFC normalization
normalized = unicodedata.normalize("NFC", password)
# Step 3: Width mapping (NFKC would do this, but passwords use NFC)
# For passwords, NFC is preferred to preserve user intent
# Step 4: Check for empty result
if not normalized or normalized.strip() == "":
raise ValueError("Password cannot be empty or whitespace-only")
return normalized
Username Processing (UsernameCaseMapped Profile)
import unicodedata
def prepare_username(username):
# Step 1: Width mapping (fullwidth -> normal)
mapped = unicodedata.normalize("NFKC", username)
# Step 2: Case mapping (lowercase)
mapped = mapped.casefold()
# Step 3: Normalize to NFC
mapped = unicodedata.normalize("NFC", mapped)
# Step 4: Check prohibited characters
for ch in mapped:
cat = unicodedata.category(ch)
# Reject spaces (Zs), controls (Cc/Cf), symbols (Sk/So), etc.
if cat in ("Zs", "Cc", "Cs"):
raise ValueError(f"Prohibited character category: {cat}")
# Step 5: Check for empty result
if not mapped:
raise ValueError("Username cannot be empty")
return mapped
Normalization in Authentication
Which Normalization Form?
| Context | Form | Reason |
|---|---|---|
| Passwords | NFC | Preserves user intent, consistent representation |
| Usernames | NFKC + casefold | Maximum compatibility, case-insensitive matching |
| Display names | NFC | Preserves visual appearance |
| Domain names | NFKC | Required by IDNA2008 |
The Critical Rule
Normalize BEFORE hashing. If you normalize after hashing, it is too late — different byte representations will produce different hashes regardless.
import hashlib
import unicodedata
def hash_password(password, salt):
# ALWAYS normalize first
normalized = unicodedata.normalize("NFC", password)
# Then hash
return hashlib.pbkdf2_hmac(
"sha256",
normalized.encode("utf-8"),
salt,
iterations=600_000,
)
What Can Go Wrong
| Scenario | Problem | Result |
|---|---|---|
| No normalization | NFC and NFD forms hash differently | Login failure |
| Normalize only at registration | Form changes between reg and login | Login failure |
| NFKC for passwords | Destroys distinctions user intended | Security reduction |
| casefold() for passwords | "Pa$$WORD" becomes "pa$$word" | Security reduction |
| Normalize after hashing | Hash of un-normalized bytes | No benefit |
Practical Implementation
Complete Authentication Flow
import unicodedata
import hashlib
import os
class UnicodeAuthenticator:
@staticmethod
def prepare_username(username):
# 1. NFKC normalization (width mapping + compatibility)
result = unicodedata.normalize("NFKC", username)
# 2. Case fold
result = result.casefold()
# 3. NFC (after casefold may produce non-NFC)
result = unicodedata.normalize("NFC", result)
# 4. Strip leading/trailing whitespace
result = result.strip()
# 5. Reject empty
if not result:
raise ValueError("Username cannot be empty")
# 6. Reject control characters
for ch in result:
if unicodedata.category(ch) in ("Cc", "Cf", "Cs"):
raise ValueError(f"Invalid character: U+{ord(ch):04X}")
return result
@staticmethod
def prepare_password(password):
# 1. NFC normalization only (preserve case, preserve intent)
result = unicodedata.normalize("NFC", password)
# 2. Reject empty
if not result:
raise ValueError("Password cannot be empty")
# 3. Reject control characters except tab
for ch in result:
if unicodedata.category(ch) == "Cc" and ch != "\t":
raise ValueError(f"Invalid control character")
return result
@staticmethod
def hash_password(password, salt=None):
if salt is None:
salt = os.urandom(32)
prepared = UnicodeAuthenticator.prepare_password(password)
hashed = hashlib.pbkdf2_hmac(
"sha256",
prepared.encode("utf-8"),
salt,
iterations=600_000,
)
return salt, hashed
@staticmethod
def verify_password(password, salt, expected_hash):
prepared = UnicodeAuthenticator.prepare_password(password)
computed = hashlib.pbkdf2_hmac(
"sha256",
prepared.encode("utf-8"),
salt,
iterations=600_000,
)
return computed == expected_hash
Database Considerations
| Setting | Recommendation |
|---|---|
| Column encoding | UTF-8 (utf8mb4 in MySQL) |
| Collation | Binary or C collation for passwords; language-aware for usernames |
| Max length | Measure in bytes, not characters (multi-byte chars expand length) |
| Unique constraints | Apply on normalized form, not raw input |
Testing Unicode Authentication
# Test cases every Unicode-aware auth system should pass
test_cases = [
# NFC vs NFD equivalence
("caf\u00E9", "cafe\u0301", True), # cafe == cafe
# Case sensitivity in passwords
("Password", "password", False), # should NOT match for passwords
# Width variants
("\uFF21\uFF22\uFF23", "ABC", True), # fullwidth ABC == ABC (for usernames)
# Invisible characters
("admin\u200B", "admin", False), # trailing ZWSP (should be stripped or rejected)
# Turkish I problem
("Istanbul", "istanbul", True), # casefold handles this
# Note: casefold() maps I -> i (not Turkish-aware by default)
]
Key Takeaways
- Unicode creates authentication pitfalls through multiple representations of the same visual text (NFC vs. NFD), invisible characters in credentials, case folding variations across locales, and confusable characters in usernames.
- The PRECIS framework (RFC 8264/8265) defines standard profiles for processing passwords (OpaqueString: NFC, no case mapping) and usernames (UsernameCaseMapped: NFKC + casefold).
- Normalize BEFORE hashing — this is the single most critical rule. Different normalization forms produce different byte sequences and therefore different hashes.
- Use NFC for passwords (preserves user intent) and NFKC + casefold for case-insensitive usernames (maximum compatibility).
- Databases must use UTF-8 encoding with binary collation for password hashes and apply unique constraints on normalized usernames, not raw input.
- Every authentication system should include test cases for NFC/NFD equivalence, width variants, invisible character handling, and locale-specific case folding.
Mais em Unicode Security
Unicode's vast character set introduces a range of security vulnerabilities including homograph …
IDN homograph attacks use look-alike Unicode characters to register domain names that …
Zero-width and other invisible Unicode characters can be used to fingerprint text …
Phishing attacks increasingly exploit Unicode confusables, bidirectional overrides, and invisible characters to …