What is 文字エンコーディング?

文字をデジタル保存・送信用のバイト列にマッピングするシステム。すべてのテキストファイルにはエンコーディングがあり、正しく宣言されているかどうかが重要です。

1文字あたり1〜4バイトを使う可変長Unicode エンコーディング。Webの主流エンコーディング（98%以上）で、ASCIIと完全な後方互換性があります。

What is Windows-1252?

ISO 8859-1のMicrosoft拡張版で、0x80〜0x9Fの範囲にスマートクォート・ダッシュ・ユーロ記号を追加します。最も一般的なレガシーラテンエンコーディングです。

プログラミングと開発

文字化け

誤ったエンコーディングでバイトをデコードした際に生じる文字化けテキスト。日本語の用語（文字化け）。例：'café'をUTF-8で保存してLatin-1で読むと'cafÃ©'になります。

2024-05-13 · Updated 2025-03-28

What Is Mojibake?

Mojibake (文字化け, pronounced "mo-ji-ba-ke") is Japanese for "character transformation" or "garbled characters." It describes the phenomenon where text is displayed or processed using the wrong character encoding, causing it to appear as a meaningless jumble of symbols.

The word itself is elegant evidence of the problem it describes: 文字 (moji) means "character/letter" and 化け (bake) means "to transform" or "to turn into a ghost/monster."

How Mojibake Occurs

Mojibake happens when the encoding used to decode bytes differs from the encoding used to produce them. The bytes are correct, but the interpretation is wrong.

UTF-8 source → stored as UTF-8 bytes → read back as Latin-1
"café"  →  63 61 66 C3 A9  →  "cafÃ©"

The bytes C3 A9 are the UTF-8 encoding of é (U+00E9). In Latin-1, C3 = Ã and A9 = ©. Result: cafÃ©.

Classic Mojibake Patterns

Original → Mojibake (UTF-8 read as Latin-1 / Windows-1252)
é → Ã©     (U+00E9 → C3 A9)
ü → Ã¼     (U+00FC → C3 BC)
– → â€"    (U+2013 EN DASH → E2 80 93)
" → â€œ   (U+201C LEFT DOUBLE QUOTATION MARK → E2 80 9C)
" → â€      (U+201D RIGHT DOUBLE QUOTATION MARK → E2 80 9D)
™ → â„¢    (U+2122 TRADE MARK SIGN → E2 84 A2)

Original → Mojibake (Shift-JIS read as UTF-8)
"日本語" → various errors or replacement chars

Detecting and Fixing Mojibake in Python

# Common pattern: UTF-8 bytes decoded as Latin-1, then re-encoded
mangled = "cafÃ©"  # UTF-8 decoded as Latin-1

# Fix: re-encode as Latin-1, then decode as UTF-8
fixed = mangled.encode("latin-1").decode("utf-8")
print(fixed)  # "café"

# For Windows-1252 mojibake
windows_mangled = "caf\u00c3\u00a9"  # Windows-1252 misread
fixed2 = windows_mangled.encode("windows-1252").decode("utf-8")

# ftfy library: automatically fixes most common mojibake
import ftfy  # pip install ftfy
ftfy.fix_text("â€œHello worldâ€")  # '"Hello world"'
ftfy.fix_text("cafÃ©")             # "café"
ftfy.fix_text("日本語")            # unchanged (already correct)

Database Mojibake

A common scenario: a MySQL database set to latin1 connection charset, receiving UTF-8 data:

-- MySQL: wrong connection charset
SET NAMES latin1;  -- database thinks client sends Latin-1
-- Insert UTF-8 encoded bytes...
-- Each UTF-8 multi-byte sequence stored as multiple Latin-1 chars

-- Query shows mojibake:
SELECT name FROM users;  -- "cafÃ©"

-- Fix: database actually stores UTF-8 bytes, just needs correct charset
SET NAMES utf8mb4;
SELECT name FROM users;  -- "café"

Prevention

Use UTF-8 everywhere: source code, database connection, HTTP headers, file I/O.
Declare encoding explicitly: HTTP Content-Type: text/html; charset=UTF-8, HTML <meta charset="UTF-8">, MySQL CHARSET=utf8mb4.
Validate at boundaries: decode input bytes at the first opportunity; encode output bytes at the last moment.
Test with non-ASCII content: include accented characters and CJK in test data.

Mojibake in Different Contexts

Context	Common cause	Fix
Web page	Missing/wrong charset declaration	`<meta charset="UTF-8">` + HTTP header
Database	MySQL latin1 connection + UTF-8 data	`SET NAMES utf8mb4`
File	Wrong `encoding` argument in `open()`	`open(f, encoding="utf-8")`
Terminal	Terminal encoding ≠ process encoding	`PYTHONIOENCODING=utf-8`
Email	Missing MIME charset	Proper MIME headers

Quick Facts

Property	Value
Japanese meaning	文字化け — "character transformation"
Root cause	Encoding mismatch between write and read
Most common pattern	UTF-8 bytes read as Latin-1 or Windows-1252
Python fix library	`ftfy` (fixes text for you)
Manual fix pattern	`.encode("latin-1").decode("utf-8")`
Prevention	UTF-8 everywhere + explicit encoding declarations

プログラミングと開発のその他の用語

Java Unicode

Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …

Python Unicode

Python 3 uses Unicode strings by default (str = UTF-8 internally via …

Rust Unicode

Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …

Unicode エスケープシーケンス

ソースコードでUnicode文字を表す構文。言語によって異なります：\u2713（Python/Java/JS）・\u{2713}（JS/Ruby/Rust）・\U00012345（Python/C）。

Unicode 正規表現

Unicodeプロパティを使う正規表現パターン：\p{L}（任意の文字）・\p{Script=Greek}（ギリシャ語スクリプト）・\p{Emoji}。言語や正規表現エンジンによってサポートが異なります。

エンコーディング / デコーディング

エンコーディングは文字をバイトに変換し（str.encode('utf-8')）、デコーディングはバイトを文字に変換します（bytes.decode('utf-8')）。正しく行えば文字化けを防げます。

サロゲートペア

UTF-16で補助文字をエンコードするために使われる2つの16ビットコード単位（上位サロゲートU+D800〜U+DBFF ＋下位サロゲートU+DC00〜U+DFFF）。😀 = D83D DE00。

ヌル文字

U+0000（NUL）。最初のUnicode/ASCII文字で、C/C++では文字列ターミネータとして使われます。セキュリティリスク：ヌルバイト挿入は脆弱なシステムで文字列を切り捨てる可能性があります。

不可視文字

目に見えるグリフを持たない文字：空白・ゼロ幅文字・制御文字・書式文字。スプーフィングやテキスト密輸などのセキュリティ問題を引き起こす可能性があります。

文字列

プログラミング言語における文字のシーケンス。内部表現はさまざまです：UTF-8（Go・Rust・新しいPython）・UTF-16（Java・JavaScript・C#）・UTF-32（Python）。

← 用語集へ