🖥️ Platform Guides

Unicode in Passwords: Security Implications

Allowing Unicode characters in passwords increases the keyspace and can improve security, but it also introduces normalization ambiguity, where the same visible password maps to different byte sequences. This guide explores the security and usability implications of Unicode passwords, covering normalization, SASLprep, and how major platforms handle Unicode credentials.

Published 2024-09-10 · Updated 2025-10-20

Passwords are the most common authentication mechanism on the internet, yet the question of whether they should (or can) contain Unicode characters is surprisingly complex. Using Unicode in passwords can dramatically increase the key space for brute-force resistance, but it also introduces normalization, encoding, and interoperability challenges that can lock users out of their own accounts. This guide explores the technical standards for Unicode passwords (PRECIS framework, RFC 8265), the security implications, and practical guidance for both users and developers.

Why Unicode Passwords Matter

The mathematical argument for Unicode passwords is compelling:

Character Set	Pool Size	Entropy per Character
Digits only (0-9)	10	3.32 bits
Lowercase ASCII (a-z)	26	4.70 bits
ASCII printable (32-126)	95	6.57 bits
ASCII + Latin-1 Supplement	191	7.58 bits
Unicode BMP (U+0000-U+FFFF)	~65,000	15.99 bits
All Unicode (v16.0)	~149,813	17.19 bits

A password using characters from the full Unicode range carries far more entropy per character than an ASCII-only password. A 4-character Unicode password from the full BMP could have the same entropy as an 8-character ASCII password.

Real-world example

Password	Characters	Entropy (approx.)
`password`	8 ASCII lowercase	37.6 bits
`P@ssw0rd!`	9 ASCII printable	59.1 bits
(4 CJK chars)	4 from CJK Unified (~21K)	57.4 bits
(3 mixed-script)	3 from full BMP	48.0 bits

PRECIS Framework (RFC 8264)

The PRECIS (Preparation, Enforcement, and Comparison of Internationalized Strings) framework, defined in RFC 8264, provides rules for handling Unicode strings in internet protocols. It replaces the older SASLprep (RFC 4013) and Stringprep (RFC 3454) frameworks.

PRECIS defines two base string classes:

Class	Name	Purpose
IdentifierClass	FreeformClass base	Usernames, identifiers
FreeformClass	Full range	Passwords, display names

OpaqueString Profile (RFC 8265)

RFC 8265 defines the OpaqueString profile for passwords. "Opaque" means the password is treated as an opaque blob — the system should not interpret its content (no case mapping, minimal normalization).

The OpaqueString profile specifies:

Rule	Action
Width mapping	Map fullwidth/halfwidth to normal form
Unicode normalization	Apply NFC normalization
Prohibited characters	Reject: old hangul jamo, control characters, spaces at start/end
Bidirectional rules	Check that BiDi text is well-formed
Space handling	Spaces allowed in the middle (mapped to U+0020)

What OpaqueString normalizes

Input:                  Output after OpaqueString:
Fullwidth A -> Normal A
e + combining accent -> precomposed e with accent (NFC)
Tab character -> REJECTED (control character)
Leading spaces -> REJECTED
"pass word" -> "pass word" (internal space preserved, mapped to U+0020)

What OpaqueString does NOT do

No case folding: "Password" and "password" remain different
No symbol mapping: Fullwidth symbols stay as-is (except width mapping)
No confusable mapping: Cyrillic "a" and Latin "a" remain distinct

The Normalization Problem

The fundamental challenge with Unicode passwords is that the same visual password can have different binary representations:

Visual Appearance	Representation	Bytes (UTF-8)
e with accent	U+00E9 (precomposed)	C3 A9
e with accent	U+0065 + U+0301 (decomposed)	65 CC 81

If a user creates a password with the precomposed form but their keyboard later sends the decomposed form, the passwords will not match at the byte level — even though they look identical.

NFC normalization as the solution

PRECIS mandates NFC (Canonical Decomposition followed by Canonical Composition) normalization for passwords. NFC converts decomposed sequences to their precomposed equivalents:

Input	NFC Output	Match?
U+0065 U+0301	U+00E9	Yes (after NFC on both)
U+00E9	U+00E9	Yes
U+304C (precomposed ga)	U+304C	Yes
U+304B U+3099 (ka + voiced mark)	U+304C	Yes

Both the stored password hash and the login attempt must be NFC-normalized before comparison. If either side skips normalization, authentication can fail for legitimate users.

Encoding Issues

Passwords are transmitted and stored as byte sequences. The encoding (UTF-8, UTF-16, or others) must be consistent:

Scenario	Risk
Form submitted as UTF-8, server expects Latin-1	Password silently corrupted
Mobile app uses UTF-8, Windows client uses UTF-16	Different byte sequences
Database collation change	Existing hashes may not match
Password manager normalizes differently	User locked out

The safe approach

Convert the password to UTF-8
Apply PRECIS OpaqueString preparation (NFC + width mapping)
Hash the resulting bytes
Store only the hash

All clients and servers must agree on this pipeline. If any step differs, authentication breaks.

Implementation Concerns

Hashing Unicode passwords

Password hashing algorithms (bcrypt, scrypt, Argon2) operate on bytes. The input must be a deterministic byte sequence:

import unicodedata
import bcrypt

def prepare_password(password: str) -> bytes:
    # Step 1: PRECIS-like normalization
    # (Full PRECIS implementation would check for prohibited chars)
    normalized = unicodedata.normalize("NFC", password)

    # Step 2: Encode as UTF-8
    return normalized.encode("utf-8")

def hash_password(password: str) -> bytes:
    prepared = prepare_password(password)
    return bcrypt.hashpw(prepared, bcrypt.gensalt())

def verify_password(password: str, hashed: bytes) -> bool:
    prepared = prepare_password(password)
    return bcrypt.checkpw(prepared, hashed)

bcrypt's 72-byte limit

bcrypt truncates input to 72 bytes. For ASCII passwords, this allows 72 characters. For Unicode passwords, the limit is lower because characters take more bytes:

Character Type	UTF-8 Bytes	Max Characters (bcrypt)
ASCII	1	72
Latin Extended	2	36
CJK	3	24
Emoji	4	18

If you use bcrypt, consider pre-hashing with SHA-256 (producing a fixed 32-byte input) to avoid the truncation issue. Alternatively, use Argon2, which has no such limit.

Input method variability

Users may type the same Unicode character differently depending on their input method:

Input Method	Output for "na" in Japanese
Romaji IME (before conversion)	"na" (Latin)
Romaji IME (after conversion)	hiragana "na"
Kana keyboard	hiragana "na" (directly)
Copy-paste from web	Could be any form

This variability means that a password containing Japanese characters might be entered differently on different devices, even by the same user.

Security Implications

Advantages of Unicode passwords

Advantage	Details
Larger key space	Brute-force attacks must search a vastly larger space
Dictionary resistance	Standard password dictionaries are ASCII-only
Cultural familiarity	Users can use passwords in their native language
Memorable phrases	Native-language phrases are easier to remember

Risks of Unicode passwords

Risk	Details
Normalization inconsistency	Different systems normalize differently
Encoding mismatch	UTF-8 vs UTF-16 vs Latin-1
Input method variation	Same character typed differently on different devices
Password recovery difficulty	Hard to communicate Unicode passwords verbally
Limited support	Some systems reject or silently strip non-ASCII
Confusable characters	Cyrillic "a" vs Latin "a" look identical

Confusable characters in passwords

Unicode contains thousands of characters that look similar or identical:

Latin	Look-alike	Script
a (U+0061)	a (U+0430)	Cyrillic
o (U+006F)	o (U+03BF)	Greek
p (U+0070)	p (U+0440)	Cyrillic
H (U+0048)	H (U+041D)	Cyrillic

For passwords, confusables are a user experience risk (users may accidentally type the wrong character) but not a security risk (passwords are compared as byte sequences, so confusables are distinct). The danger is that a user creates a password with one character and cannot reproduce it later because they unknowingly typed the confusable.

What Major Platforms Do

Platform	Unicode Password Support	Notes
Google	Yes	Accepts wide range of Unicode
Apple	Yes	NFC normalization applied
Microsoft	Yes (with caveats)	Active Directory has legacy restrictions
GitHub	Yes	Accepts Unicode
AWS	Limited	Some services restrict to ASCII printable
Many banks	No	Often restrict to ASCII subset
Wi-Fi (WPA)	Yes (with caveats)	WPA2 uses PRECIS/SASLprep

WPA2/WPA3 and Unicode

Wi-Fi passwords (WPA2-Personal pre-shared keys) are processed through an algorithm that operates on bytes. The standard specifies: - WPA2: Password is 8-63 printable ASCII characters (IEEE 802.11-2020) - WPA3-SAE: Supports full Unicode via PRECIS OpaqueString (RFC 8265)

Many routers accept Unicode Wi-Fi passwords, but interoperability issues are common because client devices may encode the password differently.

Best Practices for Developers

Apply PRECIS OpaqueString (RFC 8265) preparation before hashing. At minimum, apply NFC normalization and width mapping.
Encode as UTF-8 consistently on all clients and servers.
Use Argon2 (or scrypt) instead of bcrypt to avoid the 72-byte truncation issue.
Test with diverse scripts: Include Latin accented characters, CJK, Arabic, and emoji in your test suite.
Validate early, reject clearly: If your system cannot handle Unicode passwords, tell users at registration time — not when they try to log in.
Never silently strip or transform characters. If you normalize, do it consistently and document it.

Best Practices for Users

Test your password immediately after creating it. Log out and log back in to verify it works.
Use a password manager that stores passwords as byte sequences (most modern managers do this correctly).
Avoid mixing confusable characters (Latin and Cyrillic) in the same password.
Prefer NFC precomposed characters if typing manually.
Be cautious with system-specific characters — a password that works on your phone might not work on a shared computer with a different keyboard layout.

Key Takeaways

Unicode passwords offer dramatically higher entropy per character than ASCII-only passwords, making them more resistant to brute-force attacks.
The PRECIS framework (RFC 8264/8265) defines the standard for preparing Unicode passwords: NFC normalization, width mapping, and prohibited character checks.
NFC normalization is essential: without it, visually identical passwords may fail to match because they have different binary representations.
The biggest risks of Unicode passwords are normalization inconsistency, encoding mismatches, and input method variability — not cryptographic weaknesses.
Developers should apply PRECIS OpaqueString preparation, encode as UTF-8, and use hash algorithms without byte-length limits (Argon2 over bcrypt).
Users should test Unicode passwords immediately after creation and rely on password managers for consistent storage and replay.

เพิ่มเติมใน Platform Guides

Unicode in Microsoft Word

Microsoft Word supports the full Unicode character set and provides several methods …

Unicode in Google Docs & Sheets

Google Docs and Sheets use UTF-8 internally and provide a Special Characters …

Unicode in Terminal / Command Line

Modern terminals support Unicode and UTF-8, but correctly displaying all Unicode characters …

Unicode in PDF Documents

PDF supports Unicode text through embedded fonts and ToUnicode maps, but many …

Unicode in Excel

Microsoft Excel stores text in Unicode but has historically struggled with non-Latin …

Unicode in Social Media

Social media platforms handle Unicode text with varying degrees of support, affecting …

Unicode in XML and JSON

Both XML and JSON are defined to use Unicode text, but each …

Unicode in Data Science and NLP

Natural language processing and data science pipelines frequently encounter Unicode issues including …

Unicode in QR Codes

QR codes can encode Unicode text using UTF-8, but many QR code …

← กลับไปยังคู่มือ