🔧 Practical Unicode

How to Fix Mojibake (Garbled Text)

Mojibake is the garbled text you see when a file encoded in one character set is interpreted as another, producing strings like ’ instead of the apostrophe or ë instead of ë. This guide explains why mojibake happens, how to diagnose the original encoding, and the practical steps to fix garbled text in files, databases, and applications.

·

You open a file and instead of readable text you see é where there should be an "e\u0301", or ?????? where Japanese text should appear, or привет instead of Russian "\u043f\u0440\u0438\u0432\u0435\u0442". This garbled output is called mojibake (\u6587\u5b57\u5316\u3051, Japanese for "character transformation"), and it is one of the most common and frustrating text encoding problems.

This guide explains why mojibake happens, how to identify the original encoding from the garbled output, and how to fix and prevent it in your projects.

What Causes Mojibake

Mojibake occurs when text encoded in one character encoding is decoded using a different, incompatible encoding. The bytes are valid in both encodings, but they map to completely different characters.

The Mechanism

Consider the French word "e\u0301te\u0301" (summer). In UTF-8, the character e\u0301 (U+00E9) is stored as two bytes: 0xC3 0xA9. If a program reads those bytes as Latin-1 (ISO 8859-1) instead of UTF-8, it interprets each byte separately:

Byte UTF-8 Interpretation Latin-1 Interpretation
0xC3 (first byte of two-byte sequence) \u00c3 (A with tilde)
0xA9 (second byte, completes e\u0301) \u00a9 (copyright sign)

So "e\u0301te\u0301" becomes "été" -- classic mojibake.

Common Mojibake Patterns

Recognizing the pattern tells you which encoding mismatch occurred:

You See Original What Happened
é e\u0301 UTF-8 read as Latin-1
ü u\u0308 UTF-8 read as Latin-1
ñ n\u0303 UTF-8 read as Latin-1
Ã\u00a7 c\u0327 UTF-8 read as Latin-1
Ã\u00b6 o\u0308 UTF-8 read as Latin-1
\ufffd\ufffd\ufffd (any non-ASCII) UTF-8 read with replacement chars
?????? (any non-ASCII) Encoding unsupported, replaced with ?
\u00e6\u0096\u0087\u00e5\u00ad\u0097 \u6587\u5b57 UTF-8 CJK read as Latin-1
\u0413\u0403\u0402\u0453\u0432\u0435\u0442 \u043f\u0440\u0438\u0432\u0435\u0442 UTF-8 Cyrillic read as Windows-1251

The most common pattern worldwide is UTF-8 misread as Latin-1 / Windows-1252, because UTF-8 multi-byte sequences happen to contain bytes that are valid Latin-1 characters.

How to Identify the Original Encoding

Step 1: Examine the Garbled Text

Look for telltale byte patterns:

  • à followed by another character (é, ü, ñ, ö): Almost certainly UTF-8 read as Latin-1
  •  followed by a symbol (©, ®, °): UTF-8 read as Latin-1 (the  is byte 0xC2)
  • Å followed by a character (Å\u00a1, Å\u00be): UTF-8 of Central European text read as Latin-1
  • Double mojibake (Ã\u0�\u00a9): UTF-8 that was "fixed" by encoding to Latin-1 and back, twice

Step 2: Try Re-encoding

The fix is to reverse the process: take the garbled text, encode it back to the wrong encoding, then decode it with the correct one.

# Python: Fix UTF-8 mojibake caused by Latin-1 misread
garbled = "été"
fixed = garbled.encode("latin-1").decode("utf-8")
print(fixed)  # "e\u0301te\u0301"
# Fix double mojibake (garbled twice)
double_garbled = "Ã\u0083©"
step1 = double_garbled.encode("latin-1").decode("utf-8")
step2 = step1.encode("latin-1").decode("utf-8")
print(step2)  # "e\u0301"

Step 3: Use Detection Tools

When you cannot identify the encoding by pattern, use automatic detection:

# pip install chardet
import chardet

with open("mystery_file.txt", "rb") as f:
    raw = f.read()
    result = chardet.detect(raw)
    print(result)
    # {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

# Decode with detected encoding
text = raw.decode(result["encoding"])
# pip install charset-normalizer (more accurate alternative)
from charset_normalizer import from_bytes

raw = b"\\xc3\\xa9t\\xc3\\xa9"  # UTF-8 bytes for "e\u0301te\u0301"
results = from_bytes(raw)
best = results.best()
print(best.encoding)  # 'utf-8'
print(str(best))      # 'e\u0301te\u0301'

Command-Line Tools

# Detect encoding with file command
file -i mystery_file.txt
# mystery_file.txt: text/plain; charset=utf-8

# Convert encoding with iconv
iconv -f WINDOWS-1252 -t UTF-8 input.txt > output.txt

# Convert encoding with recode
recode WINDOWS-1252..UTF-8 input.txt

# Detect with uchardet (more accurate than file)
uchardet mystery_file.txt
# UTF-8

The ftfy Library (Python)

The Python library ftfy ("fixes text for you") is specifically designed to repair mojibake automatically. It recognizes dozens of common mojibake patterns and reverses them.

# pip install ftfy
import ftfy

# Single mojibake
print(ftfy.fix_text("été"))              # "e\u0301te\u0301"
print(ftfy.fix_text("schön"))             # "scho\u0308n"

# Double mojibake
print(ftfy.fix_text("Ã\u0083©"))           # "e\u0301"

# Mixed mojibake and correct text
print(ftfy.fix_text("I love éclairs"))    # "I love e\u0301clairs"

# CJK mojibake
print(ftfy.fix_text("\u00e6\u0096\u0087\u00e5\u00ad\u0097"))  # "\u6587\u5b57"

# Explain what ftfy did
print(ftfy.explain_unicode("é"))

ftfy is the single best tool for fixing mojibake in Python. It handles edge cases that simple encode/decode cycles miss, including partial mojibake, mixed encodings, and HTML entity confusion.

Common Mojibake Scenarios and Fixes

Scenario 1: Database Mojibake

You inserted UTF-8 text into a MySQL database with latin1 encoding. The data is stored as raw bytes, but MySQL treats them as Latin-1 characters.

-- Check current encoding
SHOW VARIABLES LIKE 'character_set_%';

-- Fix: Tell MySQL the column is actually binary, then convert
ALTER TABLE articles MODIFY content BLOB;
ALTER TABLE articles MODIFY content TEXT CHARACTER SET utf8mb4;

Scenario 2: CSV File Mojibake

You exported a CSV from Excel on Windows and opened it on Linux. Windows likely used Windows-1252 encoding; your Linux tool expected UTF-8.

import csv

# Read with the correct encoding
with open("export.csv", encoding="windows-1252") as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

Scenario 3: Email Mojibake

An email header says Content-Type: text/plain; charset=iso-8859-1 but the body is actually UTF-8. Your email client decodes using the declared charset and produces mojibake.

Fix: Save the raw email source (.eml file) and re-decode the body:

raw_body = garbled_text.encode("iso-8859-1")
fixed = raw_body.decode("utf-8")

Scenario 4: Web Page Mojibake

A web page declares <meta charset="iso-8859-1"> but serves UTF-8 content, or vice versa.

Fix for your own site:

<!-- Always declare UTF-8 as the very first thing in <head> -->
<meta charset="utf-8">

Fix for reading someone else's site:

import requests

r = requests.get("https://example.com")
# Override detected encoding
r.encoding = "utf-8"
print(r.text)

Prevention Checklist

The best fix for mojibake is to never create it in the first place. Follow this checklist:

1. Use UTF-8 Everywhere

Layer Setting
Files Save as UTF-8 (with or without BOM)
HTML <meta charset="utf-8"> as first child of <head>
HTTP Content-Type: text/html; charset=utf-8
Database utf8mb4 (MySQL) or UTF8 (PostgreSQL)
Connection SET NAMES utf8mb4 (MySQL)
CSV export UTF-8 with BOM for Excel compatibility
JSON Always UTF-8 (per RFC 8259)
XML <?xml version="1.0" encoding="UTF-8"?>

2. Declare Encoding Explicitly

Never rely on defaults or auto-detection. Always declare the encoding at every layer:

# Python: always specify encoding when opening files
with open("data.txt", encoding="utf-8") as f:
    text = f.read()
// Java: specify charset in InputStreamReader
BufferedReader reader = new BufferedReader(
    new InputStreamReader(new FileInputStream("data.txt"), StandardCharsets.UTF_8)
);

3. Validate at Boundaries

Check encoding at every system boundary -- file I/O, network I/O, database connections, API calls:

def safe_decode(data: bytes, declared_encoding: str = "utf-8") -> str:
    try:
        return data.decode(declared_encoding)
    except UnicodeDecodeError:
        # Fallback: detect encoding
        import chardet
        detected = chardet.detect(data)
        return data.decode(detected["encoding"] or "utf-8", errors="replace")

4. Test with Non-ASCII Data

Always test your software with text that includes:

  • Accented Latin characters: e\u0301, u\u0308, n\u0303, c\u0327
  • CJK characters: \u4e2d\u6587, \u65e5\u672c\u8a9e, \ud55c\uad6d\uc5b4
  • Cyrillic: \u041f\u0440\u0438\u0432\u0435\u0442
  • Arabic: \u0645\u0631\u062d\u0628\u0627
  • Emoji: \U0001f600\U0001f680\U0001f30d

If all these round-trip correctly through your system, your encoding handling is solid.

Practical Unicode içinde daha fazlası

How to Type Special Characters on Windows

Windows provides several methods for typing special characters and Unicode symbols, including …

How to Type Special Characters on Mac

macOS makes it easy to type special characters and Unicode symbols through …

How to Type Special Characters on Linux

Linux offers multiple ways to insert Unicode characters, including Ctrl+Shift+U followed by …

Special Characters on Mobile (iOS/Android)

Typing special Unicode characters on smartphones requires different techniques than on desktop …

Unicode in Databases

Storing Unicode text in a database requires choosing the right charset, collation, …

Unicode in Filenames

Modern operating systems support Unicode filenames, but different filesystems use different encodings …

Unicode in Email

Email evolved from ASCII-only systems, and supporting Unicode in email subjects, bodies, …

Unicode in Domain Names (IDN)

Internationalized Domain Names (IDNs) allow domain names to contain non-ASCII characters from …

Unicode for Accessibility

Using Unicode symbols, special characters, and emoji in web content has important …

Unicode Text Direction: LTR vs RTL

Unicode supports both left-to-right and right-to-left text through the bidirectional algorithm and …

Unicode Fonts: How Characters Get Rendered

A font file only contains glyphs for a subset of Unicode characters, …

How to Find Any Unicode Character

Finding the exact Unicode character you need can be challenging given over …

Unicode Copy and Paste Best Practices

Copying and pasting text between applications can introduce invisible characters, change normalization …

How to Create Fancy Text with Unicode

Unicode's Mathematical Alphanumeric Symbols block and other areas contain bold, italic, script, …