Unicode in Python
Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, normalization, and grapheme clusters still requires careful attention. This guide covers everything developers need to know about Unicode in Python, from the str type to the unicodedata module and third-party libraries.
Python 3 made a decisive choice: every string (str) is a sequence of Unicode
code points. There is no longer a separate unicode type as there was in
Python 2. This design means that working with Unicode in Python 3 is mostly
transparent — until you need to cross the boundary between text and bytes, or
until a subtle normalization or encoding bug surfaces. This guide covers
everything you need to handle Unicode confidently in Python 3.
Python 3 Strings Are Unicode by Default
In Python 3, string literals are Unicode. You can embed any character directly in source code (assuming your editor saves as UTF-8, which is the default Python source encoding since PEP 3120):
greeting = "こんにちは" # Japanese: "Hello"
arrow = "→" # U+2192 RIGHTWARDS ARROW
emoji = "🐍" # U+1F40D SNAKE
price = "€42.00" # U+20AC EURO SIGN
print(len(greeting)) # 5 — five code points, not bytes
print(ord("→")) # 8594
print(hex(ord("→"))) # '0x2192'
print(chr(0x2192)) # '→'
len() counts code points, not bytes. ord() converts a single-character
string to its code point integer. chr() is the inverse.
Encoding and Decoding
The bytes type holds raw byte sequences. Converting between str and
bytes always requires specifying an encoding:
# str → bytes (encoding)
text = "café"
raw = text.encode("utf-8") # b'caf\\xc3\\xa9' (5 bytes for 4 chars)
raw16 = text.encode("utf-16-le") # little-endian UTF-16
# bytes → str (decoding)
back = raw.decode("utf-8") # "café"
# Handling errors
raw_bad = b"caf\\xe9" # Latin-1 byte, not valid UTF-8
safe = raw_bad.decode("utf-8", errors="replace") # "caf\\ufffd"
ignore = raw_bad.decode("utf-8", errors="ignore") # "caf"
latin = raw_bad.decode("latin-1") # "café" (correct)
Common error handlers:
| Handler | Behaviour |
|---|---|
"strict" (default) |
Raises UnicodeDecodeError |
"replace" |
Substitutes U+FFFD REPLACEMENT CHARACTER |
"ignore" |
Silently drops the bad bytes |
"backslashreplace" |
Inserts \\xNN escape |
"xmlcharrefreplace" |
Inserts &#NNN; HTML entity |
Detecting an Unknown Encoding
Python ships with chardet-compatible detection via the charset-normalizer
package (used by requests):
import charset_normalizer
raw = open("mystery.txt", "rb").read()
result = charset_normalizer.from_bytes(raw).best()
print(result.encoding) # e.g. "utf-8" or "windows-1252"
The unicodedata Module
The unicodedata standard-library module exposes the Unicode Character
Database (UCD) for every code point:
import unicodedata
# Name
print(unicodedata.name("→")) # 'RIGHTWARDS ARROW'
print(unicodedata.name("é")) # 'LATIN SMALL LETTER E WITH ACUTE'
# Category
print(unicodedata.category("A")) # 'Lu' — Letter, uppercase
print(unicodedata.category(" ")) # 'Zs' — Separator, space
print(unicodedata.category("3")) # 'Nd' — Number, decimal digit
# Lookup by name
arrow = unicodedata.lookup("RIGHTWARDS ARROW") # '→'
# Numeric value
print(unicodedata.numeric("½")) # 0.5
print(unicodedata.digit("7")) # 7
Unicode Categories
| Code | Meaning | Example |
|---|---|---|
Lu |
Letter, uppercase | A, Ä, Σ |
Ll |
Letter, lowercase | a, ä, σ |
Nd |
Number, decimal digit | 0–9, Arabic-Indic digits |
Po |
Punctuation, other | ., !, ? |
Sm |
Symbol, math | +, =, ∞ |
Zs |
Separator, space | SPACE, NBSP |
Cc |
Other, control | NUL, TAB, LF |
unicodedata.east_asian_width
Useful for terminal width calculations:
unicodedata.east_asian_width("A") # 'Na' — Narrow
unicodedata.east_asian_width("中") # 'W' — Wide (2 columns)
Unicode Normalization
The same visual character can be stored as different byte sequences. For example, "é" can be:
- NFC: U+00E9 — precomposed LATIN SMALL LETTER E WITH ACUTE (1 code point)
- NFD: U+0065 + U+0301 — "e" followed by combining accent (2 code points)
Comparing unnormalized strings gives surprising results:
import unicodedata
s1 = "\\u00e9" # é (NFC, 1 code point)
s2 = "e\\u0301" # é (NFD, 2 code points)
print(s1 == s2) # False ← surprising!
print(s1 == unicodedata.normalize("NFC", s2)) # True
Recommended practice: normalize to NFC before storing, comparing, or displaying user-supplied text. NFC is the form used natively by macOS and the web.
def normalize(text: str) -> str:
return unicodedata.normalize("NFC", text)
The four normalization forms:
| Form | Description |
|---|---|
| NFC | Canonical decomposition, then canonical composition (recommended for storage) |
| NFD | Canonical decomposition only |
| NFKC | Compatibility decomposition, then canonical composition |
| NFKD | Compatibility decomposition only |
NFKC/NFKD fold compatibility characters (e.g., fi → fi, ① → 1). Use
them when you want to normalize user input for search or comparison, but not
for display.
String Methods and Unicode Awareness
Python's built-in string methods are Unicode-aware:
"café".upper() # 'CAFÉ'
"ΣΕΛΉΝΗ".lower() # 'σελήνη' (Greek)
"hello world".title() # 'Hello World'
"naïve".isalpha() # True (ï is a letter)
"٣".isnumeric() # True (Arabic-Indic digit THREE)
str.isidentifier() and Keywords
"café".isidentifier() # True — valid Python identifier
"class".isidentifier() # True — but it's a keyword
import keyword
keyword.iskeyword("class") # True
Sorting Unicode Strings
str.sort() uses code point order, not locale-aware collation. For
proper language-sensitive sorting, use the locale module or the
pyicu / natsort libraries:
import locale
locale.setlocale(locale.LC_ALL, "de_DE.UTF-8")
words = ["Äpfel", "Orangen", "Bananen"]
words.sort(key=locale.strxfrm) # German alphabetical order
Reading and Writing Files
Always specify an encoding. UTF-8 is the safe default:
# Writing
with open("data.txt", "w", encoding="utf-8") as f:
f.write("こんにちは\n")
# Reading
with open("data.txt", "r", encoding="utf-8") as f:
text = f.read()
# Reading a file with unknown encoding gracefully
with open("legacy.txt", "r", encoding="cp1252", errors="replace") as f:
text = f.read()
Python's open() uses the system default encoding if you omit encoding=.
On Windows this is often cp1252, not UTF-8. Always be explicit.
Common Pitfalls
1. Mixing str and bytes
# This raises TypeError — you cannot concatenate str and bytes
"hello" + b" world" # TypeError
# Always decode first or encode first
"hello" + b" world".decode("utf-8") # "hello world"
2. Emoji and Supplementary Characters
Code points above U+FFFF (emoji, rare CJK) are single str elements but
two UTF-16 code units. Python's len() counts code points correctly, but
indexing on strings that cross surrogate-pair boundaries in UTF-16 data can
cause issues when interacting with C extensions or databases that use UTF-16
internally:
emoji = "🐍"
len(emoji) # 1 (one code point)
emoji.encode("utf-16-le") # b'\r\\xd8M\\xdc' (surrogate pair, 4 bytes)
3. JSON and Non-ASCII
json.dumps() escapes non-ASCII by default. Pass ensure_ascii=False to
keep Unicode characters readable:
import json
data = {"name": "日本語"}
print(json.dumps(data)) # {"name": "\\u65e5\\u672c\\u8a9e"}
print(json.dumps(data, ensure_ascii=False)) # {"name": "日本語"}
4. Regular Expressions
Use re.UNICODE (it is on by default in Python 3) and \\w, \\d, \\s
match Unicode word characters, digits, and spaces respectively. See the
dedicated Unicode Regex guide for advanced patterns.
Quick Reference
| Task | Code |
|---|---|
| Get code point | ord("→") → 8594 |
| Code point to char | chr(0x2192) → "→" |
| Encode to bytes | "café".encode("utf-8") |
| Decode bytes | b"caf\\xc3\\xa9".decode("utf-8") |
| Get character name | unicodedata.name("→") |
| Get character category | unicodedata.category("A") |
| Normalize NFC | unicodedata.normalize("NFC", text) |
| Find character by name | unicodedata.lookup("SNOWMAN") |
| Safe file read | open(f, encoding="utf-8") |
| JSON with Unicode | json.dumps(d, ensure_ascii=False) |
Python 3's Unicode support is thorough, but encoding bugs still happen at system boundaries: file I/O, network sockets, and third-party C extensions. Following the three golden rules — always specify encoding, normalize before comparing, and decode at the earliest opportunity — will prevent the vast majority of Unicode bugs.
More in Unicode in Code
JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …
Java's char type is a 16-bit UTF-16 code unit, not a full …
Go's string type is a sequence of bytes, and its rune type …
Rust's str and String types are guaranteed to be valid UTF-8, making …
C and C++ have historically poor Unicode support, with char being a …
Ruby strings carry an explicit encoding, with UTF-8 being the default since …
PHP's built-in string functions operate on bytes rather than Unicode characters, which …
Swift's String type is designed with Unicode correctness as a first-class concern, …
HTML and CSS support Unicode characters directly and through escape sequences, allowing …
Unicode-aware regular expressions let you match characters by script, category, or property …
SQL databases store text in encodings and collations that determine how characters …
URLs are technically restricted to ASCII characters, so non-ASCII text must be …
Every major programming language has its own syntax for embedding Unicode characters …
JSON is defined as Unicode text and must be encoded in UTF-8, …