What is Encodage de caractères?

Système qui associe des caractères à des séquences d'octets pour le stockage et la transmission numérique. Tout fichier texte possède un encodage ; la question est de savoir s'il est correctement déclaré.

Encodage Unicode à longueur variable utilisant 1 à 4 octets par caractère. L'encodage dominant du web (98 %+ des sites) avec une pleine compatibilité ascendante avec ASCII.

What is Windows-1252?

Surensemble de ISO 8859-1 par Microsoft, ajoutant des guillemets typographiques, le tiret em et le signe euro dans la plage 0x80–0x9F. L'encodage "latin" hérité le plus courant.

Programmation et développement

Mojibake

Texte illisible résultant du décodage d'octets avec le mauvais encodage. Terme japonais (文字化け). Exemple : 'café' stocké en UTF-8 mais lu en Latin-1 → 'cafÃ©'.

2024-05-13 · Updated 2025-03-28

What Is Mojibake?

Mojibake (文字化け, pronounced "mo-ji-ba-ke") is Japanese for "character transformation" or "garbled characters." It describes the phenomenon where text is displayed or processed using the wrong character encoding, causing it to appear as a meaningless jumble of symbols.

The word itself is elegant evidence of the problem it describes: 文字 (moji) means "character/letter" and 化け (bake) means "to transform" or "to turn into a ghost/monster."

How Mojibake Occurs

Mojibake happens when the encoding used to decode bytes differs from the encoding used to produce them. The bytes are correct, but the interpretation is wrong.

UTF-8 source → stored as UTF-8 bytes → read back as Latin-1
"café"  →  63 61 66 C3 A9  →  "cafÃ©"

The bytes C3 A9 are the UTF-8 encoding of é (U+00E9). In Latin-1, C3 = Ã and A9 = ©. Result: cafÃ©.

Classic Mojibake Patterns

Original → Mojibake (UTF-8 read as Latin-1 / Windows-1252)
é → Ã©     (U+00E9 → C3 A9)
ü → Ã¼     (U+00FC → C3 BC)
– → â€"    (U+2013 EN DASH → E2 80 93)
" → â€œ   (U+201C LEFT DOUBLE QUOTATION MARK → E2 80 9C)
" → â€      (U+201D RIGHT DOUBLE QUOTATION MARK → E2 80 9D)
™ → â„¢    (U+2122 TRADE MARK SIGN → E2 84 A2)

Original → Mojibake (Shift-JIS read as UTF-8)
"日本語" → various errors or replacement chars

Detecting and Fixing Mojibake in Python

# Common pattern: UTF-8 bytes decoded as Latin-1, then re-encoded
mangled = "cafÃ©"  # UTF-8 decoded as Latin-1

# Fix: re-encode as Latin-1, then decode as UTF-8
fixed = mangled.encode("latin-1").decode("utf-8")
print(fixed)  # "café"

# For Windows-1252 mojibake
windows_mangled = "caf\u00c3\u00a9"  # Windows-1252 misread
fixed2 = windows_mangled.encode("windows-1252").decode("utf-8")

# ftfy library: automatically fixes most common mojibake
import ftfy  # pip install ftfy
ftfy.fix_text("â€œHello worldâ€")  # '"Hello world"'
ftfy.fix_text("cafÃ©")             # "café"
ftfy.fix_text("日本語")            # unchanged (already correct)

Database Mojibake

A common scenario: a MySQL database set to latin1 connection charset, receiving UTF-8 data:

-- MySQL: wrong connection charset
SET NAMES latin1;  -- database thinks client sends Latin-1
-- Insert UTF-8 encoded bytes...
-- Each UTF-8 multi-byte sequence stored as multiple Latin-1 chars

-- Query shows mojibake:
SELECT name FROM users;  -- "cafÃ©"

-- Fix: database actually stores UTF-8 bytes, just needs correct charset
SET NAMES utf8mb4;
SELECT name FROM users;  -- "café"

Prevention

Use UTF-8 everywhere: source code, database connection, HTTP headers, file I/O.
Declare encoding explicitly: HTTP Content-Type: text/html; charset=UTF-8, HTML <meta charset="UTF-8">, MySQL CHARSET=utf8mb4.
Validate at boundaries: decode input bytes at the first opportunity; encode output bytes at the last moment.
Test with non-ASCII content: include accented characters and CJK in test data.

Mojibake in Different Contexts

Context	Common cause	Fix
Web page	Missing/wrong charset declaration	`<meta charset="UTF-8">` + HTTP header
Database	MySQL latin1 connection + UTF-8 data	`SET NAMES utf8mb4`
File	Wrong `encoding` argument in `open()`	`open(f, encoding="utf-8")`
Terminal	Terminal encoding ≠ process encoding	`PYTHONIOENCODING=utf-8`
Email	Missing MIME charset	Proper MIME headers

Quick Facts

Property	Value
Japanese meaning	文字化け — "character transformation"
Root cause	Encoding mismatch between write and read
Most common pattern	UTF-8 bytes read as Latin-1 or Windows-1252
Python fix library	`ftfy` (fixes text for you)
Manual fix pattern	`.encode("latin-1").decode("utf-8")`
Prevention	UTF-8 everywhere + explicit encoding declarations

Termes associés

Encodage de caractères UTF-8 Windows-1252

Plus dans Programmation et développement

Ambiguïté de longueur de chaîne

La « longueur » d'une chaîne Unicode dépend de l'unité : unités …

Caractère de remplacement

U+FFFD (�). Affiché lorsqu'un décodeur rencontre des séquences d'octets invalides — le …

Caractère invisible

Tout caractère sans glyphe visible : espaces blancs, caractères de largeur nulle, …

Caractère nul

U+0000 (NUL). Le premier caractère Unicode/ASCII, utilisé comme terminateur de chaîne en …

Chaîne de caractères

Une séquence de caractères dans un langage de programmation. La représentation interne …

Encodage / Décodage

L'encodage convertit les caractères en octets (str.encode('utf-8')) ; le décodage convertit les …

Expression régulière Unicode

Modèles de regex utilisant les propriétés Unicode : \p{L} (toute lettre), \p{Script=Greek} …

Java Unicode

Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …

Paire de substitution

Deux unités de code de 16 bits (un substitut haut U+D800–U+DBFF + …

Python Unicode

Python 3 uses Unicode strings by default (str = UTF-8 internally via …

← Retour au glossaire