Python Unicode
Python 3 uses Unicode strings by default (str = UTF-8 internally via PEP 393). Key features: \N{name} escapes, unicodedata module, str.encode()/bytes.decode() for encoding conversion.
What is Python 3 Unicode Handling?
Python 3 made a decisive architectural choice: the built-in str type is always a sequence of Unicode characters, never raw bytes. This broke backward compatibility with Python 2 but eliminated an entire class of text encoding bugs. In Python 3, there is no longer an ambiguous "native string" — str means Unicode text, and bytes means binary data.
str as a Sequence of Unicode Characters
A Python 3 str object holds Unicode code points. The length of a string is the number of code points, not the number of bytes:
s = "café"
len(s) # 4 — four code points (c, a, f, é)
len(s.encode("utf-8")) # 5 — five bytes (é = 2 bytes in UTF-8)
Python internally uses one of three representations for str, selected automatically based on the highest code point in the string: Latin-1 (1 byte/char), UCS-2 (2 bytes/char), or UCS-4 (4 bytes/char). This is called the PEP 393 compact representation, introduced in Python 3.3 to reduce memory use for ASCII-heavy strings.
Named Character Escapes
Python string literals support the \N{name} escape, which inserts a character by its official Unicode name. This is far more readable than raw hex values:
snowman = "\N{SNOWMAN}" # ☃ (U+2603)
euro = "\N{EURO SIGN}" # € (U+20AC)
black_heart = "\N{BLACK HEART SUIT}" # ♥ (U+2665)
The unicodedata Module
The standard library unicodedata module exposes Unicode Character Database (UCD) properties for any code point:
import unicodedata
unicodedata.name("\u00e9") # "LATIN SMALL LETTER E WITH ACUTE"
unicodedata.category("A") # "Lu" (Letter, uppercase)
unicodedata.bidirectional("\u0627") # "AL" (Arabic Letter)
unicodedata.combining("\u0301") # 230 (combining class for acute accent)
unicodedata.is_normalized("NFC", "e\u0301") # False
encode() and decode()
Converting between str and bytes is explicit in Python 3:
# str → bytes
"Hello, 世界".encode("utf-8") # b'Hello, \xe4\xb8\x96\xe7\x95\x8c'
"Hello, 世界".encode("utf-16") # b'\xff\xfeH\x00e\x00...'
# bytes → str
b"caf\xc3\xa9".decode("utf-8") # "café"
Always specify the encoding explicitly. Relying on the platform default (sys.getdefaultencoding()) leads to fragile code that breaks on systems with different locale settings.
Normalization
import unicodedata
s_nfd = "e\u0301" # e + combining acute
s_nfc = unicodedata.normalize("NFC", s_nfd) # → "é" (U+00E9)
unicodedata.normalize("NFKD", "\ufb01") # "fi" (ligature decomposed)
Use NFC for storage and display. Use NFKD for search indexing and text analysis where ligatures and compatibility characters should be treated as their base equivalents.
Quick Facts
| Feature | Detail |
|---|---|
str type |
Unicode code points (not bytes) |
| Internal encoding | PEP 393 compact (Latin-1 / UCS-2 / UCS-4 auto-selected) |
| Named escape | \N{UNICODE CHARACTER NAME} |
| Module | unicodedata (name, category, normalize, etc.) |
| Encode method | str.encode("utf-8") → bytes |
| Decode method | bytes.decode("utf-8") → str |
| Normalization forms | NFC, NFD, NFKC, NFKD via unicodedata.normalize() |
| Case folding | str.casefold() (Unicode-aware, better than .lower()) |
| Python version | Python 3.0+ (PEP 393 compact storage: 3.3+) |
相关术语
编程与开发 中的更多内容
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …
Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …
使用Unicode属性的正则表达式模式:\p{L}(任意字母)、\p{Script=Greek}(希腊文字)、\p{Emoji},各语言和正则引擎的支持程度不同。
在源代码中表示Unicode字符的语法,各语言不同:\u2713(Python/Java/JS)、\u{2713}(JS/Ruby/Rust)、\U00012345(Python/C)。
无可见字形的字符:空白、零宽字符、控制字符和格式字符,可能引发欺骗和文本隐写等安全问题。
用错误编码解码字节时产生的乱码文本,来自日语词“文字化け”。例如:'café'以UTF-8存储但用Latin-1读取,显示为'café'。
在UTF-16中一起编码补充字符的两个16位码元(高代理U+D800–U+DBFF + 低代理U+DC00–U+DFFF),😀 = D83D DE00。
编程语言中的字符序列,内部表示各异:UTF-8(Go、Rust、新版Python)、UTF-16(Java、JavaScript、C#)或UTF-32(Python)。
Unicode字符串的“长度”取决于计量单位:码元(JavaScript .length)、码位(Python len())或字素簇。👨👩👧👦 = 7个码位,1个字素。
U+FFFD(�),解码器遇到无效字节序列时显示的字符——“解码出错”的通用符号。