What is Mã hóa ký tự?

Hệ thống ánh xạ ký tự sang chuỗi byte để lưu trữ và truyền dữ liệu số. Mỗi tệp văn bản đều có một mã hóa — vấn đề là liệu nó có được khai báo đúng hay không.

Mã hóa Unicode có độ dài thay đổi sử dụng 1–4 byte cho mỗi ký tự. Định dạng mã hóa phổ biến nhất trên web (hơn 98% trang web) và tương thích ngược hoàn toàn với ASCII.

What is Windows-1252?

Bộ ký tự của Microsoft mở rộng từ ISO 8859-1, bổ sung dấu ngoặc kép, dấu gạch ngang dài và ký hiệu euro trong phạm vi 0x80–0x9F. Đây là mã hóa "Latin" kế thừa phổ biến nhất.

Lập trình và phát triển

Mojibake

Văn bản bị hỏng do giải mã byte bằng mã hóa sai. Thuật ngữ tiếng Nhật (文字化け). Ví dụ: 'café' được lưu dưới dạng UTF-8 nhưng đọc là Latin-1 → 'cafÃ©'.

2024-05-13 · Updated 2025-03-28

What Is Mojibake?

Mojibake (文字化け, pronounced "mo-ji-ba-ke") is Japanese for "character transformation" or "garbled characters." It describes the phenomenon where text is displayed or processed using the wrong character encoding, causing it to appear as a meaningless jumble of symbols.

The word itself is elegant evidence of the problem it describes: 文字 (moji) means "character/letter" and 化け (bake) means "to transform" or "to turn into a ghost/monster."

How Mojibake Occurs

Mojibake happens when the encoding used to decode bytes differs from the encoding used to produce them. The bytes are correct, but the interpretation is wrong.

UTF-8 source → stored as UTF-8 bytes → read back as Latin-1
"café"  →  63 61 66 C3 A9  →  "cafÃ©"

The bytes C3 A9 are the UTF-8 encoding of é (U+00E9). In Latin-1, C3 = Ã and A9 = ©. Result: cafÃ©.

Classic Mojibake Patterns

Original → Mojibake (UTF-8 read as Latin-1 / Windows-1252)
é → Ã©     (U+00E9 → C3 A9)
ü → Ã¼     (U+00FC → C3 BC)
– → â€"    (U+2013 EN DASH → E2 80 93)
" → â€œ   (U+201C LEFT DOUBLE QUOTATION MARK → E2 80 9C)
" → â€      (U+201D RIGHT DOUBLE QUOTATION MARK → E2 80 9D)
™ → â„¢    (U+2122 TRADE MARK SIGN → E2 84 A2)

Original → Mojibake (Shift-JIS read as UTF-8)
"日本語" → various errors or replacement chars

Detecting and Fixing Mojibake in Python

# Common pattern: UTF-8 bytes decoded as Latin-1, then re-encoded
mangled = "cafÃ©"  # UTF-8 decoded as Latin-1

# Fix: re-encode as Latin-1, then decode as UTF-8
fixed = mangled.encode("latin-1").decode("utf-8")
print(fixed)  # "café"

# For Windows-1252 mojibake
windows_mangled = "caf\u00c3\u00a9"  # Windows-1252 misread
fixed2 = windows_mangled.encode("windows-1252").decode("utf-8")

# ftfy library: automatically fixes most common mojibake
import ftfy  # pip install ftfy
ftfy.fix_text("â€œHello worldâ€")  # '"Hello world"'
ftfy.fix_text("cafÃ©")             # "café"
ftfy.fix_text("日本語")            # unchanged (already correct)

Database Mojibake

A common scenario: a MySQL database set to latin1 connection charset, receiving UTF-8 data:

-- MySQL: wrong connection charset
SET NAMES latin1;  -- database thinks client sends Latin-1
-- Insert UTF-8 encoded bytes...
-- Each UTF-8 multi-byte sequence stored as multiple Latin-1 chars

-- Query shows mojibake:
SELECT name FROM users;  -- "cafÃ©"

-- Fix: database actually stores UTF-8 bytes, just needs correct charset
SET NAMES utf8mb4;
SELECT name FROM users;  -- "café"

Prevention

Use UTF-8 everywhere: source code, database connection, HTTP headers, file I/O.
Declare encoding explicitly: HTTP Content-Type: text/html; charset=UTF-8, HTML <meta charset="UTF-8">, MySQL CHARSET=utf8mb4.
Validate at boundaries: decode input bytes at the first opportunity; encode output bytes at the last moment.
Test with non-ASCII content: include accented characters and CJK in test data.

Mojibake in Different Contexts

Context	Common cause	Fix
Web page	Missing/wrong charset declaration	`<meta charset="UTF-8">` + HTTP header
Database	MySQL latin1 connection + UTF-8 data	`SET NAMES utf8mb4`
File	Wrong `encoding` argument in `open()`	`open(f, encoding="utf-8")`
Terminal	Terminal encoding ≠ process encoding	`PYTHONIOENCODING=utf-8`
Email	Missing MIME charset	Proper MIME headers

Quick Facts

Property	Value
Japanese meaning	文字化け — "character transformation"
Root cause	Encoding mismatch between write and read
Most common pattern	UTF-8 bytes read as Latin-1 or Windows-1252
Python fix library	`ftfy` (fixes text for you)
Manual fix pattern	`.encode("latin-1").decode("utf-8")`
Prevention	UTF-8 everywhere + explicit encoding declarations