Python Unicode
Python 3 uses Unicode strings by default (str = UTF-8 internally via PEP 393). Key features: \N{name} escapes, unicodedata module, str.encode()/bytes.decode() for encoding conversion.
What is Python 3 Unicode Handling?
Python 3 made a decisive architectural choice: the built-in str type is always a sequence of Unicode characters, never raw bytes. This broke backward compatibility with Python 2 but eliminated an entire class of text encoding bugs. In Python 3, there is no longer an ambiguous "native string" — str means Unicode text, and bytes means binary data.
str as a Sequence of Unicode Characters
A Python 3 str object holds Unicode code points. The length of a string is the number of code points, not the number of bytes:
s = "café"
len(s) # 4 — four code points (c, a, f, é)
len(s.encode("utf-8")) # 5 — five bytes (é = 2 bytes in UTF-8)
Python internally uses one of three representations for str, selected automatically based on the highest code point in the string: Latin-1 (1 byte/char), UCS-2 (2 bytes/char), or UCS-4 (4 bytes/char). This is called the PEP 393 compact representation, introduced in Python 3.3 to reduce memory use for ASCII-heavy strings.
Named Character Escapes
Python string literals support the \N{name} escape, which inserts a character by its official Unicode name. This is far more readable than raw hex values:
snowman = "\N{SNOWMAN}" # ☃ (U+2603)
euro = "\N{EURO SIGN}" # € (U+20AC)
black_heart = "\N{BLACK HEART SUIT}" # ♥ (U+2665)
The unicodedata Module
The standard library unicodedata module exposes Unicode Character Database (UCD) properties for any code point:
import unicodedata
unicodedata.name("\u00e9") # "LATIN SMALL LETTER E WITH ACUTE"
unicodedata.category("A") # "Lu" (Letter, uppercase)
unicodedata.bidirectional("\u0627") # "AL" (Arabic Letter)
unicodedata.combining("\u0301") # 230 (combining class for acute accent)
unicodedata.is_normalized("NFC", "e\u0301") # False
encode() and decode()
Converting between str and bytes is explicit in Python 3:
# str → bytes
"Hello, 世界".encode("utf-8") # b'Hello, \xe4\xb8\x96\xe7\x95\x8c'
"Hello, 世界".encode("utf-16") # b'\xff\xfeH\x00e\x00...'
# bytes → str
b"caf\xc3\xa9".decode("utf-8") # "café"
Always specify the encoding explicitly. Relying on the platform default (sys.getdefaultencoding()) leads to fragile code that breaks on systems with different locale settings.
Normalization
import unicodedata
s_nfd = "e\u0301" # e + combining acute
s_nfc = unicodedata.normalize("NFC", s_nfd) # → "é" (U+00E9)
unicodedata.normalize("NFKD", "\ufb01") # "fi" (ligature decomposed)
Use NFC for storage and display. Use NFKD for search indexing and text analysis where ligatures and compatibility characters should be treated as their base equivalents.
Quick Facts
| Feature | Detail |
|---|---|
str type |
Unicode code points (not bytes) |
| Internal encoding | PEP 393 compact (Latin-1 / UCS-2 / UCS-4 auto-selected) |
| Named escape | \N{UNICODE CHARACTER NAME} |
| Module | unicodedata (name, category, normalize, etc.) |
| Encode method | str.encode("utf-8") → bytes |
| Decode method | bytes.decode("utf-8") → str |
| Normalization forms | NFC, NFD, NFKC, NFKD via unicodedata.normalize() |
| Case folding | str.casefold() (Unicode-aware, better than .lower()) |
| Python version | Python 3.0+ (PEP 393 compact storage: 3.3+) |
Términos relacionados
Más en Programación y desarrollo
La "longitud" de una cadena Unicode depende de la unidad: unidades de …
Una secuencia de caracteres en un lenguaje de programación. La representación interna …
U+FFFD (�). Se muestra cuando un decodificador encuentra secuencias de bytes no …
Cualquier carácter sin glifo visible: espacio en blanco, caracteres de anchura cero, …
U+0000 (NUL). El primer carácter Unicode/ASCII, usado como terminador de cadenas en …
La codificación convierte caracteres en bytes (str.encode('utf-8')); la decodificación convierte bytes en …
Patrones de expresiones regulares que usan propiedades Unicode: \p{L} (cualquier letra), \p{Script=Greek} …
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …
Texto ilegible resultado de decodificar bytes con la codificación incorrecta. Término japonés …
Dos unidades de código de 16 bits (un sustituto alto U+D800–U+DBFF + …