Mojibake
Texte illisible résultant du décodage d'octets avec le mauvais encodage. Terme japonais (文字化け). Exemple : 'café' stocké en UTF-8 mais lu en Latin-1 → 'café'.
What Is Mojibake?
Mojibake (文字化け, pronounced "mo-ji-ba-ke") is Japanese for "character transformation" or "garbled characters." It describes the phenomenon where text is displayed or processed using the wrong character encoding, causing it to appear as a meaningless jumble of symbols.
The word itself is elegant evidence of the problem it describes: 文字 (moji) means "character/letter" and 化け (bake) means "to transform" or "to turn into a ghost/monster."
How Mojibake Occurs
Mojibake happens when the encoding used to decode bytes differs from the encoding used to produce them. The bytes are correct, but the interpretation is wrong.
UTF-8 source → stored as UTF-8 bytes → read back as Latin-1
"café" → 63 61 66 C3 A9 → "café"
The bytes C3 A9 are the UTF-8 encoding of é (U+00E9). In Latin-1, C3 = à and A9 = ©. Result: café.
Classic Mojibake Patterns
Original → Mojibake (UTF-8 read as Latin-1 / Windows-1252)
é → é (U+00E9 → C3 A9)
ü → ü (U+00FC → C3 BC)
– → â€" (U+2013 EN DASH → E2 80 93)
" → “ (U+201C LEFT DOUBLE QUOTATION MARK → E2 80 9C)
" → †(U+201D RIGHT DOUBLE QUOTATION MARK → E2 80 9D)
™ → â„¢ (U+2122 TRADE MARK SIGN → E2 84 A2)
Original → Mojibake (Shift-JIS read as UTF-8)
"日本語" → various errors or replacement chars
Detecting and Fixing Mojibake in Python
# Common pattern: UTF-8 bytes decoded as Latin-1, then re-encoded
mangled = "café" # UTF-8 decoded as Latin-1
# Fix: re-encode as Latin-1, then decode as UTF-8
fixed = mangled.encode("latin-1").decode("utf-8")
print(fixed) # "café"
# For Windows-1252 mojibake
windows_mangled = "caf\u00c3\u00a9" # Windows-1252 misread
fixed2 = windows_mangled.encode("windows-1252").decode("utf-8")
# ftfy library: automatically fixes most common mojibake
import ftfy # pip install ftfy
ftfy.fix_text("“Hello worldâ€") # '"Hello world"'
ftfy.fix_text("café") # "café"
ftfy.fix_text("日本語") # unchanged (already correct)
Database Mojibake
A common scenario: a MySQL database set to latin1 connection charset, receiving UTF-8 data:
-- MySQL: wrong connection charset
SET NAMES latin1; -- database thinks client sends Latin-1
-- Insert UTF-8 encoded bytes...
-- Each UTF-8 multi-byte sequence stored as multiple Latin-1 chars
-- Query shows mojibake:
SELECT name FROM users; -- "café"
-- Fix: database actually stores UTF-8 bytes, just needs correct charset
SET NAMES utf8mb4;
SELECT name FROM users; -- "café"
Prevention
- Use UTF-8 everywhere: source code, database connection, HTTP headers, file I/O.
- Declare encoding explicitly: HTTP
Content-Type: text/html; charset=UTF-8, HTML<meta charset="UTF-8">, MySQLCHARSET=utf8mb4. - Validate at boundaries: decode input bytes at the first opportunity; encode output bytes at the last moment.
- Test with non-ASCII content: include accented characters and CJK in test data.
Mojibake in Different Contexts
| Context | Common cause | Fix |
|---|---|---|
| Web page | Missing/wrong charset declaration | <meta charset="UTF-8"> + HTTP header |
| Database | MySQL latin1 connection + UTF-8 data | SET NAMES utf8mb4 |
| File | Wrong encoding argument in open() |
open(f, encoding="utf-8") |
| Terminal | Terminal encoding ≠ process encoding | PYTHONIOENCODING=utf-8 |
| Missing MIME charset | Proper MIME headers |
Quick Facts
| Property | Value |
|---|---|
| Japanese meaning | 文字化け — "character transformation" |
| Root cause | Encoding mismatch between write and read |
| Most common pattern | UTF-8 bytes read as Latin-1 or Windows-1252 |
| Python fix library | ftfy (fixes text for you) |
| Manual fix pattern | .encode("latin-1").decode("utf-8") |
| Prevention | UTF-8 everywhere + explicit encoding declarations |
Termes associés
Plus dans Programmation et développement
La « longueur » d'une chaîne Unicode dépend de l'unité : unités …
U+FFFD (�). Affiché lorsqu'un décodeur rencontre des séquences d'octets invalides — le …
Tout caractère sans glyphe visible : espaces blancs, caractères de largeur nulle, …
U+0000 (NUL). Le premier caractère Unicode/ASCII, utilisé comme terminateur de chaîne en …
Une séquence de caractères dans un langage de programmation. La représentation interne …
L'encodage convertit les caractères en octets (str.encode('utf-8')) ; le décodage convertit les …
Modèles de regex utilisant les propriétés Unicode : \p{L} (toute lettre), \p{Script=Greek} …
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …
Deux unités de code de 16 bits (un substitut haut U+D800–U+DBFF + …
Python 3 uses Unicode strings by default (str = UTF-8 internally via …