🖥️ Platform Guides

Unicode in Passwords: Security Implications

Allowing Unicode characters in passwords increases the keyspace and can improve security, but it also introduces normalization ambiguity, where the same visible password maps to different byte sequences. This guide explores the security and usability implications of Unicode passwords, covering normalization, SASLprep, and how major platforms handle Unicode credentials.

·

Passwords are the most common authentication mechanism on the internet, yet the question of whether they should (or can) contain Unicode characters is surprisingly complex. Using Unicode in passwords can dramatically increase the key space for brute-force resistance, but it also introduces normalization, encoding, and interoperability challenges that can lock users out of their own accounts. This guide explores the technical standards for Unicode passwords (PRECIS framework, RFC 8265), the security implications, and practical guidance for both users and developers.

Why Unicode Passwords Matter

The mathematical argument for Unicode passwords is compelling:

Character Set Pool Size Entropy per Character
Digits only (0-9) 10 3.32 bits
Lowercase ASCII (a-z) 26 4.70 bits
ASCII printable (32-126) 95 6.57 bits
ASCII + Latin-1 Supplement 191 7.58 bits
Unicode BMP (U+0000-U+FFFF) ~65,000 15.99 bits
All Unicode (v16.0) ~149,813 17.19 bits

A password using characters from the full Unicode range carries far more entropy per character than an ASCII-only password. A 4-character Unicode password from the full BMP could have the same entropy as an 8-character ASCII password.

Real-world example

Password Characters Entropy (approx.)
password 8 ASCII lowercase 37.6 bits
P@ssw0rd! 9 ASCII printable 59.1 bits
(4 CJK chars) 4 from CJK Unified (~21K) 57.4 bits
(3 mixed-script) 3 from full BMP 48.0 bits

PRECIS Framework (RFC 8264)

The PRECIS (Preparation, Enforcement, and Comparison of Internationalized Strings) framework, defined in RFC 8264, provides rules for handling Unicode strings in internet protocols. It replaces the older SASLprep (RFC 4013) and Stringprep (RFC 3454) frameworks.

PRECIS defines two base string classes:

Class Name Purpose
IdentifierClass FreeformClass base Usernames, identifiers
FreeformClass Full range Passwords, display names

OpaqueString Profile (RFC 8265)

RFC 8265 defines the OpaqueString profile for passwords. "Opaque" means the password is treated as an opaque blob — the system should not interpret its content (no case mapping, minimal normalization).

The OpaqueString profile specifies:

Rule Action
Width mapping Map fullwidth/halfwidth to normal form
Unicode normalization Apply NFC normalization
Prohibited characters Reject: old hangul jamo, control characters, spaces at start/end
Bidirectional rules Check that BiDi text is well-formed
Space handling Spaces allowed in the middle (mapped to U+0020)

What OpaqueString normalizes

Input:                  Output after OpaqueString:
Fullwidth A -> Normal A
e + combining accent -> precomposed e with accent (NFC)
Tab character -> REJECTED (control character)
Leading spaces -> REJECTED
"pass word" -> "pass word" (internal space preserved, mapped to U+0020)

What OpaqueString does NOT do

  • No case folding: "Password" and "password" remain different
  • No symbol mapping: Fullwidth symbols stay as-is (except width mapping)
  • No confusable mapping: Cyrillic "a" and Latin "a" remain distinct

The Normalization Problem

The fundamental challenge with Unicode passwords is that the same visual password can have different binary representations:

Visual Appearance Representation Bytes (UTF-8)
e with accent U+00E9 (precomposed) C3 A9
e with accent U+0065 + U+0301 (decomposed) 65 CC 81

If a user creates a password with the precomposed form but their keyboard later sends the decomposed form, the passwords will not match at the byte level — even though they look identical.

NFC normalization as the solution

PRECIS mandates NFC (Canonical Decomposition followed by Canonical Composition) normalization for passwords. NFC converts decomposed sequences to their precomposed equivalents:

Input NFC Output Match?
U+0065 U+0301 U+00E9 Yes (after NFC on both)
U+00E9 U+00E9 Yes
U+304C (precomposed ga) U+304C Yes
U+304B U+3099 (ka + voiced mark) U+304C Yes

Both the stored password hash and the login attempt must be NFC-normalized before comparison. If either side skips normalization, authentication can fail for legitimate users.

Encoding Issues

Passwords are transmitted and stored as byte sequences. The encoding (UTF-8, UTF-16, or others) must be consistent:

Scenario Risk
Form submitted as UTF-8, server expects Latin-1 Password silently corrupted
Mobile app uses UTF-8, Windows client uses UTF-16 Different byte sequences
Database collation change Existing hashes may not match
Password manager normalizes differently User locked out

The safe approach

  1. Convert the password to UTF-8
  2. Apply PRECIS OpaqueString preparation (NFC + width mapping)
  3. Hash the resulting bytes
  4. Store only the hash

All clients and servers must agree on this pipeline. If any step differs, authentication breaks.

Implementation Concerns

Hashing Unicode passwords

Password hashing algorithms (bcrypt, scrypt, Argon2) operate on bytes. The input must be a deterministic byte sequence:

import unicodedata
import bcrypt

def prepare_password(password: str) -> bytes:
    # Step 1: PRECIS-like normalization
    # (Full PRECIS implementation would check for prohibited chars)
    normalized = unicodedata.normalize("NFC", password)

    # Step 2: Encode as UTF-8
    return normalized.encode("utf-8")

def hash_password(password: str) -> bytes:
    prepared = prepare_password(password)
    return bcrypt.hashpw(prepared, bcrypt.gensalt())

def verify_password(password: str, hashed: bytes) -> bool:
    prepared = prepare_password(password)
    return bcrypt.checkpw(prepared, hashed)

bcrypt's 72-byte limit

bcrypt truncates input to 72 bytes. For ASCII passwords, this allows 72 characters. For Unicode passwords, the limit is lower because characters take more bytes:

Character Type UTF-8 Bytes Max Characters (bcrypt)
ASCII 1 72
Latin Extended 2 36
CJK 3 24
Emoji 4 18

If you use bcrypt, consider pre-hashing with SHA-256 (producing a fixed 32-byte input) to avoid the truncation issue. Alternatively, use Argon2, which has no such limit.

Input method variability

Users may type the same Unicode character differently depending on their input method:

Input Method Output for "na" in Japanese
Romaji IME (before conversion) "na" (Latin)
Romaji IME (after conversion) hiragana "na"
Kana keyboard hiragana "na" (directly)
Copy-paste from web Could be any form

This variability means that a password containing Japanese characters might be entered differently on different devices, even by the same user.

Security Implications

Advantages of Unicode passwords

Advantage Details
Larger key space Brute-force attacks must search a vastly larger space
Dictionary resistance Standard password dictionaries are ASCII-only
Cultural familiarity Users can use passwords in their native language
Memorable phrases Native-language phrases are easier to remember

Risks of Unicode passwords

Risk Details
Normalization inconsistency Different systems normalize differently
Encoding mismatch UTF-8 vs UTF-16 vs Latin-1
Input method variation Same character typed differently on different devices
Password recovery difficulty Hard to communicate Unicode passwords verbally
Limited support Some systems reject or silently strip non-ASCII
Confusable characters Cyrillic "a" vs Latin "a" look identical

Confusable characters in passwords

Unicode contains thousands of characters that look similar or identical:

Latin Look-alike Script
a (U+0061) a (U+0430) Cyrillic
o (U+006F) o (U+03BF) Greek
p (U+0070) p (U+0440) Cyrillic
H (U+0048) H (U+041D) Cyrillic

For passwords, confusables are a user experience risk (users may accidentally type the wrong character) but not a security risk (passwords are compared as byte sequences, so confusables are distinct). The danger is that a user creates a password with one character and cannot reproduce it later because they unknowingly typed the confusable.

What Major Platforms Do

Platform Unicode Password Support Notes
Google Yes Accepts wide range of Unicode
Apple Yes NFC normalization applied
Microsoft Yes (with caveats) Active Directory has legacy restrictions
GitHub Yes Accepts Unicode
AWS Limited Some services restrict to ASCII printable
Many banks No Often restrict to ASCII subset
Wi-Fi (WPA) Yes (with caveats) WPA2 uses PRECIS/SASLprep

WPA2/WPA3 and Unicode

Wi-Fi passwords (WPA2-Personal pre-shared keys) are processed through an algorithm that operates on bytes. The standard specifies: - WPA2: Password is 8-63 printable ASCII characters (IEEE 802.11-2020) - WPA3-SAE: Supports full Unicode via PRECIS OpaqueString (RFC 8265)

Many routers accept Unicode Wi-Fi passwords, but interoperability issues are common because client devices may encode the password differently.

Best Practices for Developers

  1. Apply PRECIS OpaqueString (RFC 8265) preparation before hashing. At minimum, apply NFC normalization and width mapping.
  2. Encode as UTF-8 consistently on all clients and servers.
  3. Use Argon2 (or scrypt) instead of bcrypt to avoid the 72-byte truncation issue.
  4. Test with diverse scripts: Include Latin accented characters, CJK, Arabic, and emoji in your test suite.
  5. Validate early, reject clearly: If your system cannot handle Unicode passwords, tell users at registration time — not when they try to log in.
  6. Never silently strip or transform characters. If you normalize, do it consistently and document it.

Best Practices for Users

  1. Test your password immediately after creating it. Log out and log back in to verify it works.
  2. Use a password manager that stores passwords as byte sequences (most modern managers do this correctly).
  3. Avoid mixing confusable characters (Latin and Cyrillic) in the same password.
  4. Prefer NFC precomposed characters if typing manually.
  5. Be cautious with system-specific characters — a password that works on your phone might not work on a shared computer with a different keyboard layout.

Key Takeaways

  • Unicode passwords offer dramatically higher entropy per character than ASCII-only passwords, making them more resistant to brute-force attacks.
  • The PRECIS framework (RFC 8264/8265) defines the standard for preparing Unicode passwords: NFC normalization, width mapping, and prohibited character checks.
  • NFC normalization is essential: without it, visually identical passwords may fail to match because they have different binary representations.
  • The biggest risks of Unicode passwords are normalization inconsistency, encoding mismatches, and input method variability — not cryptographic weaknesses.
  • Developers should apply PRECIS OpaqueString preparation, encode as UTF-8, and use hash algorithms without byte-length limits (Argon2 over bcrypt).
  • Users should test Unicode passwords immediately after creation and rely on password managers for consistent storage and replay.

Platform Guides의 더 많은 가이드