How to Fix Mojibake (Garbled Text)
Mojibake is the garbled text you see when a file encoded in one character set is interpreted as another, producing strings like ’ instead of the apostrophe or ë instead of ë. This guide explains why mojibake happens, how to diagnose the original encoding, and the practical steps to fix garbled text in files, databases, and applications.
You open a file and instead of readable text you see é where there should be an "e\u0301",
or ?????? where Japanese text should appear, or привет instead of Russian "\u043f\u0440\u0438\u0432\u0435\u0442".
This garbled output is called mojibake (\u6587\u5b57\u5316\u3051, Japanese for "character
transformation"), and it is one of the most common and frustrating text encoding problems.
This guide explains why mojibake happens, how to identify the original encoding from the garbled output, and how to fix and prevent it in your projects.
What Causes Mojibake
Mojibake occurs when text encoded in one character encoding is decoded using a different, incompatible encoding. The bytes are valid in both encodings, but they map to completely different characters.
The Mechanism
Consider the French word "e\u0301te\u0301" (summer). In UTF-8, the character e\u0301 (U+00E9) is
stored as two bytes: 0xC3 0xA9. If a program reads those bytes as Latin-1 (ISO 8859-1)
instead of UTF-8, it interprets each byte separately:
| Byte | UTF-8 Interpretation | Latin-1 Interpretation |
|---|---|---|
| 0xC3 | (first byte of two-byte sequence) | \u00c3 (A with tilde) |
| 0xA9 | (second byte, completes e\u0301) | \u00a9 (copyright sign) |
So "e\u0301te\u0301" becomes "été" -- classic mojibake.
Common Mojibake Patterns
Recognizing the pattern tells you which encoding mismatch occurred:
| You See | Original | What Happened |
|---|---|---|
| é | e\u0301 | UTF-8 read as Latin-1 |
| ü | u\u0308 | UTF-8 read as Latin-1 |
| ñ | n\u0303 | UTF-8 read as Latin-1 |
| Ã\u00a7 | c\u0327 | UTF-8 read as Latin-1 |
| Ã\u00b6 | o\u0308 | UTF-8 read as Latin-1 |
| \ufffd\ufffd\ufffd | (any non-ASCII) | UTF-8 read with replacement chars |
| ?????? | (any non-ASCII) | Encoding unsupported, replaced with ? |
| \u00e6\u0096\u0087\u00e5\u00ad\u0097 | \u6587\u5b57 | UTF-8 CJK read as Latin-1 |
| \u0413\u0403\u0402\u0453\u0432\u0435\u0442 | \u043f\u0440\u0438\u0432\u0435\u0442 | UTF-8 Cyrillic read as Windows-1251 |
The most common pattern worldwide is UTF-8 misread as Latin-1 / Windows-1252, because UTF-8 multi-byte sequences happen to contain bytes that are valid Latin-1 characters.
How to Identify the Original Encoding
Step 1: Examine the Garbled Text
Look for telltale byte patterns:
- à followed by another character (é, ü, ñ, ö): Almost certainly UTF-8 read as Latin-1
-  followed by a symbol (©, ®, °): UTF-8 read as Latin-1 (the  is byte 0xC2)
- Å followed by a character (Å\u00a1, Å\u00be): UTF-8 of Central European text read as Latin-1
- Double mojibake (Ã\u0�\u00a9): UTF-8 that was "fixed" by encoding to Latin-1 and back, twice
Step 2: Try Re-encoding
The fix is to reverse the process: take the garbled text, encode it back to the wrong encoding, then decode it with the correct one.
# Python: Fix UTF-8 mojibake caused by Latin-1 misread
garbled = "été"
fixed = garbled.encode("latin-1").decode("utf-8")
print(fixed) # "e\u0301te\u0301"
# Fix double mojibake (garbled twice)
double_garbled = "Ã\u0083©"
step1 = double_garbled.encode("latin-1").decode("utf-8")
step2 = step1.encode("latin-1").decode("utf-8")
print(step2) # "e\u0301"
Step 3: Use Detection Tools
When you cannot identify the encoding by pattern, use automatic detection:
# pip install chardet
import chardet
with open("mystery_file.txt", "rb") as f:
raw = f.read()
result = chardet.detect(raw)
print(result)
# {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
# Decode with detected encoding
text = raw.decode(result["encoding"])
# pip install charset-normalizer (more accurate alternative)
from charset_normalizer import from_bytes
raw = b"\\xc3\\xa9t\\xc3\\xa9" # UTF-8 bytes for "e\u0301te\u0301"
results = from_bytes(raw)
best = results.best()
print(best.encoding) # 'utf-8'
print(str(best)) # 'e\u0301te\u0301'
Command-Line Tools
# Detect encoding with file command
file -i mystery_file.txt
# mystery_file.txt: text/plain; charset=utf-8
# Convert encoding with iconv
iconv -f WINDOWS-1252 -t UTF-8 input.txt > output.txt
# Convert encoding with recode
recode WINDOWS-1252..UTF-8 input.txt
# Detect with uchardet (more accurate than file)
uchardet mystery_file.txt
# UTF-8
The ftfy Library (Python)
The Python library ftfy ("fixes text for you") is specifically designed to repair mojibake automatically. It recognizes dozens of common mojibake patterns and reverses them.
# pip install ftfy
import ftfy
# Single mojibake
print(ftfy.fix_text("été")) # "e\u0301te\u0301"
print(ftfy.fix_text("schön")) # "scho\u0308n"
# Double mojibake
print(ftfy.fix_text("Ã\u0083©")) # "e\u0301"
# Mixed mojibake and correct text
print(ftfy.fix_text("I love éclairs")) # "I love e\u0301clairs"
# CJK mojibake
print(ftfy.fix_text("\u00e6\u0096\u0087\u00e5\u00ad\u0097")) # "\u6587\u5b57"
# Explain what ftfy did
print(ftfy.explain_unicode("é"))
ftfy is the single best tool for fixing mojibake in Python. It handles edge cases that simple encode/decode cycles miss, including partial mojibake, mixed encodings, and HTML entity confusion.
Common Mojibake Scenarios and Fixes
Scenario 1: Database Mojibake
You inserted UTF-8 text into a MySQL database with latin1 encoding. The data is stored
as raw bytes, but MySQL treats them as Latin-1 characters.
-- Check current encoding
SHOW VARIABLES LIKE 'character_set_%';
-- Fix: Tell MySQL the column is actually binary, then convert
ALTER TABLE articles MODIFY content BLOB;
ALTER TABLE articles MODIFY content TEXT CHARACTER SET utf8mb4;
Scenario 2: CSV File Mojibake
You exported a CSV from Excel on Windows and opened it on Linux. Windows likely used Windows-1252 encoding; your Linux tool expected UTF-8.
import csv
# Read with the correct encoding
with open("export.csv", encoding="windows-1252") as f:
reader = csv.reader(f)
for row in reader:
print(row)
Scenario 3: Email Mojibake
An email header says Content-Type: text/plain; charset=iso-8859-1 but the body is
actually UTF-8. Your email client decodes using the declared charset and produces mojibake.
Fix: Save the raw email source (.eml file) and re-decode the body:
raw_body = garbled_text.encode("iso-8859-1")
fixed = raw_body.decode("utf-8")
Scenario 4: Web Page Mojibake
A web page declares <meta charset="iso-8859-1"> but serves UTF-8 content, or vice versa.
Fix for your own site:
<!-- Always declare UTF-8 as the very first thing in <head> -->
<meta charset="utf-8">
Fix for reading someone else's site:
import requests
r = requests.get("https://example.com")
# Override detected encoding
r.encoding = "utf-8"
print(r.text)
Prevention Checklist
The best fix for mojibake is to never create it in the first place. Follow this checklist:
1. Use UTF-8 Everywhere
| Layer | Setting |
|---|---|
| Files | Save as UTF-8 (with or without BOM) |
| HTML | <meta charset="utf-8"> as first child of <head> |
| HTTP | Content-Type: text/html; charset=utf-8 |
| Database | utf8mb4 (MySQL) or UTF8 (PostgreSQL) |
| Connection | SET NAMES utf8mb4 (MySQL) |
| CSV export | UTF-8 with BOM for Excel compatibility |
| JSON | Always UTF-8 (per RFC 8259) |
| XML | <?xml version="1.0" encoding="UTF-8"?> |
2. Declare Encoding Explicitly
Never rely on defaults or auto-detection. Always declare the encoding at every layer:
# Python: always specify encoding when opening files
with open("data.txt", encoding="utf-8") as f:
text = f.read()
// Java: specify charset in InputStreamReader
BufferedReader reader = new BufferedReader(
new InputStreamReader(new FileInputStream("data.txt"), StandardCharsets.UTF_8)
);
3. Validate at Boundaries
Check encoding at every system boundary -- file I/O, network I/O, database connections, API calls:
def safe_decode(data: bytes, declared_encoding: str = "utf-8") -> str:
try:
return data.decode(declared_encoding)
except UnicodeDecodeError:
# Fallback: detect encoding
import chardet
detected = chardet.detect(data)
return data.decode(detected["encoding"] or "utf-8", errors="replace")
4. Test with Non-ASCII Data
Always test your software with text that includes:
- Accented Latin characters: e\u0301, u\u0308, n\u0303, c\u0327
- CJK characters: \u4e2d\u6587, \u65e5\u672c\u8a9e, \ud55c\uad6d\uc5b4
- Cyrillic: \u041f\u0440\u0438\u0432\u0435\u0442
- Arabic: \u0645\u0631\u062d\u0628\u0627
- Emoji: \U0001f600\U0001f680\U0001f30d
If all these round-trip correctly through your system, your encoding handling is solid.
Practical Unicode में और
Windows provides several methods for typing special characters and Unicode symbols, including …
macOS makes it easy to type special characters and Unicode symbols through …
Linux offers multiple ways to insert Unicode characters, including Ctrl+Shift+U followed by …
Typing special Unicode characters on smartphones requires different techniques than on desktop …
Storing Unicode text in a database requires choosing the right charset, collation, …
Modern operating systems support Unicode filenames, but different filesystems use different encodings …
Email evolved from ASCII-only systems, and supporting Unicode in email subjects, bodies, …
Internationalized Domain Names (IDNs) allow domain names to contain non-ASCII characters from …
Using Unicode symbols, special characters, and emoji in web content has important …
Unicode supports both left-to-right and right-to-left text through the bidirectional algorithm and …
A font file only contains glyphs for a subset of Unicode characters, …
Finding the exact Unicode character you need can be challenging given over …
Copying and pasting text between applications can introduce invisible characters, change normalization …
Unicode's Mathematical Alphanumeric Symbols block and other areas contain bold, italic, script, …