Python Unicode
Python 3 uses Unicode strings by default (str = UTF-8 internally via PEP 393). Key features: \N{name} escapes, unicodedata module, str.encode()/bytes.decode() for encoding conversion.
What is Python 3 Unicode Handling?
Python 3 made a decisive architectural choice: the built-in str type is always a sequence of Unicode characters, never raw bytes. This broke backward compatibility with Python 2 but eliminated an entire class of text encoding bugs. In Python 3, there is no longer an ambiguous "native string" — str means Unicode text, and bytes means binary data.
str as a Sequence of Unicode Characters
A Python 3 str object holds Unicode code points. The length of a string is the number of code points, not the number of bytes:
s = "café"
len(s) # 4 — four code points (c, a, f, é)
len(s.encode("utf-8")) # 5 — five bytes (é = 2 bytes in UTF-8)
Python internally uses one of three representations for str, selected automatically based on the highest code point in the string: Latin-1 (1 byte/char), UCS-2 (2 bytes/char), or UCS-4 (4 bytes/char). This is called the PEP 393 compact representation, introduced in Python 3.3 to reduce memory use for ASCII-heavy strings.
Named Character Escapes
Python string literals support the \N{name} escape, which inserts a character by its official Unicode name. This is far more readable than raw hex values:
snowman = "\N{SNOWMAN}" # ☃ (U+2603)
euro = "\N{EURO SIGN}" # € (U+20AC)
black_heart = "\N{BLACK HEART SUIT}" # ♥ (U+2665)
The unicodedata Module
The standard library unicodedata module exposes Unicode Character Database (UCD) properties for any code point:
import unicodedata
unicodedata.name("\u00e9") # "LATIN SMALL LETTER E WITH ACUTE"
unicodedata.category("A") # "Lu" (Letter, uppercase)
unicodedata.bidirectional("\u0627") # "AL" (Arabic Letter)
unicodedata.combining("\u0301") # 230 (combining class for acute accent)
unicodedata.is_normalized("NFC", "e\u0301") # False
encode() and decode()
Converting between str and bytes is explicit in Python 3:
# str → bytes
"Hello, 世界".encode("utf-8") # b'Hello, \xe4\xb8\x96\xe7\x95\x8c'
"Hello, 世界".encode("utf-16") # b'\xff\xfeH\x00e\x00...'
# bytes → str
b"caf\xc3\xa9".decode("utf-8") # "café"
Always specify the encoding explicitly. Relying on the platform default (sys.getdefaultencoding()) leads to fragile code that breaks on systems with different locale settings.
Normalization
import unicodedata
s_nfd = "e\u0301" # e + combining acute
s_nfc = unicodedata.normalize("NFC", s_nfd) # → "é" (U+00E9)
unicodedata.normalize("NFKD", "\ufb01") # "fi" (ligature decomposed)
Use NFC for storage and display. Use NFKD for search indexing and text analysis where ligatures and compatibility characters should be treated as their base equivalents.
Quick Facts
| Feature | Detail |
|---|---|
str type |
Unicode code points (not bytes) |
| Internal encoding | PEP 393 compact (Latin-1 / UCS-2 / UCS-4 auto-selected) |
| Named escape | \N{UNICODE CHARACTER NAME} |
| Module | unicodedata (name, category, normalize, etc.) |
| Encode method | str.encode("utf-8") → bytes |
| Decode method | bytes.decode("utf-8") → str |
| Normalization forms | NFC, NFD, NFKC, NFKD via unicodedata.normalize() |
| Case folding | str.casefold() (Unicode-aware, better than .lower()) |
| Python version | Python 3.0+ (PEP 393 compact storage: 3.3+) |
Istilah Terkait
Lainnya di Pemrograman & Pengembangan
"Panjang" string Unicode bergantung pada satuan: code unit (JavaScript .length), code point …
Pola regex menggunakan properti Unicode: \p{L} (huruf apa pun), \p{Script=Greek} (aksara Yunani), …
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …
U+0000 (NUL). Karakter Unicode/ASCII pertama, digunakan sebagai terminator string dalam C/C++. Risiko …
U+FFFD (�). Ditampilkan saat decoder menemukan urutan byte yang tidak valid — …
Karakter apa pun tanpa glyph yang terlihat: whitespace, karakter zero-width, karakter kontrol, …
Teks yang kacau akibat mendekode byte dengan encoding yang salah. Istilah Jepang …
Dua unit kode 16-bit (high surrogate U+D800–U+DBFF + low surrogate U+DC00–U+DFFF) yang …
Encoding mengonversi karakter ke byte (str.encode('utf-8')); decoding mengonversi byte ke karakter (bytes.decode('utf-8')). …
Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …