Programmation et développement

Python Unicode

Python 3 uses Unicode strings by default (str = UTF-8 internally via PEP 393). Key features: \N{name} escapes, unicodedata module, str.encode()/bytes.decode() for encoding conversion.

What is Python 3 Unicode Handling?

Python 3 made a decisive architectural choice: the built-in str type is always a sequence of Unicode characters, never raw bytes. This broke backward compatibility with Python 2 but eliminated an entire class of text encoding bugs. In Python 3, there is no longer an ambiguous "native string" — str means Unicode text, and bytes means binary data.

str as a Sequence of Unicode Characters

A Python 3 str object holds Unicode code points. The length of a string is the number of code points, not the number of bytes:

s = "café"
len(s)         # 4 — four code points (c, a, f, é)
len(s.encode("utf-8"))  # 5 — five bytes (é = 2 bytes in UTF-8)

Python internally uses one of three representations for str, selected automatically based on the highest code point in the string: Latin-1 (1 byte/char), UCS-2 (2 bytes/char), or UCS-4 (4 bytes/char). This is called the PEP 393 compact representation, introduced in Python 3.3 to reduce memory use for ASCII-heavy strings.

Named Character Escapes

Python string literals support the \N{name} escape, which inserts a character by its official Unicode name. This is far more readable than raw hex values:

snowman      = "\N{SNOWMAN}"                     # ☃ (U+2603)
euro         = "\N{EURO SIGN}"                   # € (U+20AC)
black_heart  = "\N{BLACK HEART SUIT}"            # ♥ (U+2665)

The unicodedata Module

The standard library unicodedata module exposes Unicode Character Database (UCD) properties for any code point:

import unicodedata

unicodedata.name("\u00e9")          # "LATIN SMALL LETTER E WITH ACUTE"
unicodedata.category("A")           # "Lu" (Letter, uppercase)
unicodedata.bidirectional("\u0627") # "AL" (Arabic Letter)
unicodedata.combining("\u0301")     # 230 (combining class for acute accent)
unicodedata.is_normalized("NFC", "e\u0301")  # False

encode() and decode()

Converting between str and bytes is explicit in Python 3:

# str → bytes
"Hello, 世界".encode("utf-8")    # b'Hello, \xe4\xb8\x96\xe7\x95\x8c'
"Hello, 世界".encode("utf-16")   # b'\xff\xfeH\x00e\x00...'

# bytes → str
b"caf\xc3\xa9".decode("utf-8")  # "café"

Always specify the encoding explicitly. Relying on the platform default (sys.getdefaultencoding()) leads to fragile code that breaks on systems with different locale settings.

Normalization

import unicodedata

s_nfd = "e\u0301"                          # e + combining acute
s_nfc = unicodedata.normalize("NFC", s_nfd)   # → "é" (U+00E9)
unicodedata.normalize("NFKD", "\ufb01")    # "fi" (ligature decomposed)

Use NFC for storage and display. Use NFKD for search indexing and text analysis where ligatures and compatibility characters should be treated as their base equivalents.

Quick Facts

Feature	Detail
`str` type	Unicode code points (not bytes)
Internal encoding	PEP 393 compact (Latin-1 / UCS-2 / UCS-4 auto-selected)
Named escape	`\N{UNICODE CHARACTER NAME}`
Module	`unicodedata` (name, category, normalize, etc.)
Encode method	`str.encode("utf-8")` → `bytes`
Decode method	`bytes.decode("utf-8")` → `str`
Normalization forms	NFC, NFD, NFKC, NFKD via `unicodedata.normalize()`
Case folding	`str.casefold()` (Unicode-aware, better than `.lower()`)
Python version	Python 3.0+ (PEP 393 compact storage: 3.3+)

Termes associés

Chaîne de caractères UTF-8 Séquence d'échappement Unicode Encodage / Décodage

Plus dans Programmation et développement

Ambiguïté de longueur de chaîne

La « longueur » d'une chaîne Unicode dépend de l'unité : unités …

Caractère de remplacement

U+FFFD (�). Affiché lorsqu'un décodeur rencontre des séquences d'octets invalides — le …

Caractère invisible

Tout caractère sans glyphe visible : espaces blancs, caractères de largeur nulle, …

Caractère nul

U+0000 (NUL). Le premier caractère Unicode/ASCII, utilisé comme terminateur de chaîne en …

Chaîne de caractères

Une séquence de caractères dans un langage de programmation. La représentation interne …

Encodage / Décodage

L'encodage convertit les caractères en octets (str.encode('utf-8')) ; le décodage convertit les …

Expression régulière Unicode

Modèles de regex utilisant les propriétés Unicode : \p{L} (toute lettre), \p{Script=Greek} …

Java Unicode

Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …

Mojibake

Texte illisible résultant du décodage d'octets avec le mauvais encodage. Terme japonais …

Paire de substitution

Deux unités de code de 16 bits (un substitut haut U+D800–U+DBFF + …

← Retour au glossaire