💻 Unicode in Code

Unicode in Python

Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, normalization, and grapheme clusters still requires careful attention. This guide covers everything developers need to know about Unicode in Python, from the str type to the unicodedata module and third-party libraries.

·

Python 3 made a decisive choice: every string (str) is a sequence of Unicode code points. There is no longer a separate unicode type as there was in Python 2. This design means that working with Unicode in Python 3 is mostly transparent — until you need to cross the boundary between text and bytes, or until a subtle normalization or encoding bug surfaces. This guide covers everything you need to handle Unicode confidently in Python 3.

Python 3 Strings Are Unicode by Default

In Python 3, string literals are Unicode. You can embed any character directly in source code (assuming your editor saves as UTF-8, which is the default Python source encoding since PEP 3120):

greeting = "こんにちは"          # Japanese: "Hello"
arrow     = "→"                   # U+2192 RIGHTWARDS ARROW
emoji     = "🐍"                  # U+1F40D SNAKE
price     = "€42.00"              # U+20AC EURO SIGN

print(len(greeting))              # 5 — five code points, not bytes
print(ord("→"))                   # 8594
print(hex(ord("→")))              # '0x2192'
print(chr(0x2192))                # '→'

len() counts code points, not bytes. ord() converts a single-character string to its code point integer. chr() is the inverse.

Encoding and Decoding

The bytes type holds raw byte sequences. Converting between str and bytes always requires specifying an encoding:

# str → bytes  (encoding)
text  = "café"
raw   = text.encode("utf-8")      # b'caf\\xc3\\xa9'  (5 bytes for 4 chars)
raw16 = text.encode("utf-16-le")  # little-endian UTF-16

# bytes → str  (decoding)
back  = raw.decode("utf-8")       # "café"

# Handling errors
raw_bad = b"caf\\xe9"              # Latin-1 byte, not valid UTF-8
safe    = raw_bad.decode("utf-8", errors="replace")   # "caf\\ufffd"
ignore  = raw_bad.decode("utf-8", errors="ignore")    # "caf"
latin   = raw_bad.decode("latin-1")                   # "café"  (correct)

Common error handlers:

Handler Behaviour
"strict" (default) Raises UnicodeDecodeError
"replace" Substitutes U+FFFD REPLACEMENT CHARACTER
"ignore" Silently drops the bad bytes
"backslashreplace" Inserts \\xNN escape
"xmlcharrefreplace" Inserts &#NNN; HTML entity

Detecting an Unknown Encoding

Python ships with chardet-compatible detection via the charset-normalizer package (used by requests):

import charset_normalizer

raw = open("mystery.txt", "rb").read()
result = charset_normalizer.from_bytes(raw).best()
print(result.encoding)            # e.g. "utf-8" or "windows-1252"

The unicodedata Module

The unicodedata standard-library module exposes the Unicode Character Database (UCD) for every code point:

import unicodedata

# Name
print(unicodedata.name("→"))          # 'RIGHTWARDS ARROW'
print(unicodedata.name("é"))          # 'LATIN SMALL LETTER E WITH ACUTE'

# Category
print(unicodedata.category("A"))      # 'Lu'  — Letter, uppercase
print(unicodedata.category(" "))      # 'Zs'  — Separator, space
print(unicodedata.category("3"))      # 'Nd'  — Number, decimal digit

# Lookup by name
arrow = unicodedata.lookup("RIGHTWARDS ARROW")   # '→'

# Numeric value
print(unicodedata.numeric("½"))       # 0.5
print(unicodedata.digit("7"))         # 7

Unicode Categories

Code Meaning Example
Lu Letter, uppercase A, Ä, Σ
Ll Letter, lowercase a, ä, σ
Nd Number, decimal digit 0–9, Arabic-Indic digits
Po Punctuation, other ., !, ?
Sm Symbol, math +, =, ∞
Zs Separator, space SPACE, NBSP
Cc Other, control NUL, TAB, LF

unicodedata.east_asian_width

Useful for terminal width calculations:

unicodedata.east_asian_width("A")   # 'Na'  — Narrow
unicodedata.east_asian_width("中")  # 'W'   — Wide (2 columns)

Unicode Normalization

The same visual character can be stored as different byte sequences. For example, "é" can be:

  • NFC: U+00E9 — precomposed LATIN SMALL LETTER E WITH ACUTE (1 code point)
  • NFD: U+0065 + U+0301 — "e" followed by combining accent (2 code points)

Comparing unnormalized strings gives surprising results:

import unicodedata

s1 = "\\u00e9"           # é  (NFC, 1 code point)
s2 = "e\\u0301"          # é  (NFD, 2 code points)

print(s1 == s2)                           # False  ← surprising!
print(s1 == unicodedata.normalize("NFC", s2))  # True

Recommended practice: normalize to NFC before storing, comparing, or displaying user-supplied text. NFC is the form used natively by macOS and the web.

def normalize(text: str) -> str:
    return unicodedata.normalize("NFC", text)

The four normalization forms:

Form Description
NFC Canonical decomposition, then canonical composition (recommended for storage)
NFD Canonical decomposition only
NFKC Compatibility decomposition, then canonical composition
NFKD Compatibility decomposition only

NFKC/NFKD fold compatibility characters (e.g., fi, 1). Use them when you want to normalize user input for search or comparison, but not for display.

String Methods and Unicode Awareness

Python's built-in string methods are Unicode-aware:

"café".upper()              # 'CAFÉ'
"ΣΕΛΉΝΗ".lower()            # 'σελήνη'  (Greek)
"hello world".title()       # 'Hello World'
"naïve".isalpha()           # True  (ï is a letter)
"٣".isnumeric()             # True  (Arabic-Indic digit THREE)

str.isidentifier() and Keywords

"café".isidentifier()       # True  — valid Python identifier
"class".isidentifier()      # True  — but it's a keyword
import keyword
keyword.iskeyword("class")  # True

Sorting Unicode Strings

str.sort() uses code point order, not locale-aware collation. For proper language-sensitive sorting, use the locale module or the pyicu / natsort libraries:

import locale
locale.setlocale(locale.LC_ALL, "de_DE.UTF-8")
words = ["Äpfel", "Orangen", "Bananen"]
words.sort(key=locale.strxfrm)   # German alphabetical order

Reading and Writing Files

Always specify an encoding. UTF-8 is the safe default:

# Writing
with open("data.txt", "w", encoding="utf-8") as f:
    f.write("こんにちは\n")

# Reading
with open("data.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Reading a file with unknown encoding gracefully
with open("legacy.txt", "r", encoding="cp1252", errors="replace") as f:
    text = f.read()

Python's open() uses the system default encoding if you omit encoding=. On Windows this is often cp1252, not UTF-8. Always be explicit.

Common Pitfalls

1. Mixing str and bytes

# This raises TypeError — you cannot concatenate str and bytes
"hello" + b" world"   # TypeError

# Always decode first or encode first
"hello" + b" world".decode("utf-8")   # "hello world"

2. Emoji and Supplementary Characters

Code points above U+FFFF (emoji, rare CJK) are single str elements but two UTF-16 code units. Python's len() counts code points correctly, but indexing on strings that cross surrogate-pair boundaries in UTF-16 data can cause issues when interacting with C extensions or databases that use UTF-16 internally:

emoji = "🐍"
len(emoji)           # 1  (one code point)
emoji.encode("utf-16-le")  # b'\r\\xd8M\\xdc'  (surrogate pair, 4 bytes)

3. JSON and Non-ASCII

json.dumps() escapes non-ASCII by default. Pass ensure_ascii=False to keep Unicode characters readable:

import json
data = {"name": "日本語"}
print(json.dumps(data))                         # {"name": "\\u65e5\\u672c\\u8a9e"}
print(json.dumps(data, ensure_ascii=False))     # {"name": "日本語"}

4. Regular Expressions

Use re.UNICODE (it is on by default in Python 3) and \\w, \\d, \\s match Unicode word characters, digits, and spaces respectively. See the dedicated Unicode Regex guide for advanced patterns.

Quick Reference

Task Code
Get code point ord("→")8594
Code point to char chr(0x2192)"→"
Encode to bytes "café".encode("utf-8")
Decode bytes b"caf\\xc3\\xa9".decode("utf-8")
Get character name unicodedata.name("→")
Get character category unicodedata.category("A")
Normalize NFC unicodedata.normalize("NFC", text)
Find character by name unicodedata.lookup("SNOWMAN")
Safe file read open(f, encoding="utf-8")
JSON with Unicode json.dumps(d, ensure_ascii=False)

Python 3's Unicode support is thorough, but encoding bugs still happen at system boundaries: file I/O, network sockets, and third-party C extensions. Following the three golden rules — always specify encoding, normalize before comparing, and decode at the earliest opportunity — will prevent the vast majority of Unicode bugs.

More in Unicode in Code