The Mojibake Problem: A History
Mojibake — Japanese for 'character transformation' — is the garbled text that results from encoding mismatches, a problem that plagued global computing for decades before UTF-8 became dominant. This article tells the history of mojibake, the chaos of incompatible national encodings, and how Unicode gradually solved a problem that once made international email a lottery.
Mojibake (文字化け) is a Japanese word that translates roughly as "character transformation" or "character ghost." It describes the visual result of text being decoded with the wrong character encoding: a stream of symbols that are recognizably characters but convey no meaning — the digital equivalent of a letter arriving in a language you cannot read.
Mojibake is not merely an aesthetic annoyance. In the era before UTF-8 dominated the web, it was a persistent obstacle to international communication, data exchange, and software interoperability. Understanding its history is understanding a decades-long failure of coordination.
The Root Cause
Every text file is a sequence of bytes. To display it as text, software must know which encoding maps those bytes to characters. When the encoding used to write the file differs from the encoding used to read it, characters are mismatched — and the result is mojibake.
The problem is compounded by the fact that many early text formats had no reliable way to declare their encoding. A file was just bytes. The software that opened it had to guess — or, worse, assume a default that might be wrong.
Shift_JIS vs. EUC-JP: The Japanese Encoding War
Japan developed digital text encoding in the late 1970s and early 1980s, and two encodings competed for dominance:
- Shift_JIS: Developed primarily by Microsoft and ASCII Corporation. Used a variable-length scheme where Latin characters used 1 byte and Japanese characters used 2 bytes. Supported on early PCs and widely used in Microsoft's Japanese Windows.
- EUC-JP (Extended Unix Code for Japanese): Used primarily on Unix workstations and servers. Also variable-length, but with a different byte structure.
Both encodings covered roughly the same JIS character set, but their byte representations were incompatible. A document written in Shift_JIS and read as EUC-JP produced mojibake and vice versa.
This was not a small-scale problem. Japan had one of the world's largest PC markets, and the split between MS-DOS/Windows (Shift_JIS) and Unix (EUC-JP) meant that Japanese text constantly moved between contexts where the encoding assumption differed. Email clients, FTP servers, bulletin boards, and early web servers were all potential mojibake generators.
A third encoding, ISO-2022-JP (also called JIS), was used specifically for email and used escape sequences to switch between ASCII and Japanese modes. A mail reader that did not support escape sequence switching would display the escape characters as literal symbols, producing a characteristically garbled result.
Latin-1 vs. UTF-8: The European Problem
In Europe, the dominant encoding for Western European languages was ISO-8859-1 (Latin-1), which extended ASCII with accented characters for languages like French, German, Spanish, and Portuguese. Windows used a slightly different variant called Windows-1252 (also known as cp1252), which added printable characters in the range 0x80–0x9F that ISO-8859-1 left undefined.
As UTF-8 adoption spread in the late 1990s and 2000s, documents began arriving in UTF-8 where Latin-1 was expected, and vice versa. The characteristic symptom was double-encoded UTF-8: a character like "é" (U+00E9) is encoded in UTF-8 as the two bytes 0xC3 0xA9. If a UTF-8 document is read as Latin-1, those two bytes are displayed as "Ã ©" — recognizable mojibake that appeared constantly on early 2000s web forums.
The Windows-1252 variant created its own specific mojibake pattern. Characters in the 0x80–0x9F range — including the euro sign (€, 0x80 in cp1252) and typographic quotes ("", 0x93 and 0x94) — were replaced with the Unicode control characters at those positions when cp1252 text was read as ISO-8859-1 or vice versa.
Email: A Particularly Hostile Environment
Email was the technology most severely affected by encoding failures. The original SMTP and MIME specifications allowed email headers to declare an encoding, but many mail systems ignored the declarations, mangled the headers, or re-encoded content during transit.
A classic scenario: a user on a Japanese Windows system (Shift_JIS) sends email to a Unix server (EUC-JP). The mail client correctly labels the message as Shift_JIS. The server strips or ignores the Content-Type header. The recipient's mail client uses EUC-JP as its default and displays mojibake.
Worse, some mail gateways actively converted encoding without updating the declaration, producing messages labeled as Shift_JIS but actually containing EUC-JP bytes — a state that no automatic detection could reliably resolve.
Web Page Charset Chaos
Early web pages had no standard mechanism for declaring their encoding. The HTTP Content-Type header could include a charset parameter, and the HTML <meta charset> tag could be embedded in documents, but many web servers did not set headers correctly, and many documents included no declaration at all.
Browsers implemented increasingly sophisticated charset detection algorithms to guess the encoding of undeclared documents. Netscape and Internet Explorer developed different heuristics. A page that Netscape correctly identified as Shift_JIS might be misidentified by Internet Explorer as ISO-2022-JP or Latin-1.
This created a peculiar incentive: web developers would sometimes deliberately structure their pages to trip the detection algorithm of the most popular browser, deliberately encoding content to "look like" what they wanted the browser to assume, rather than declaring the correct encoding explicitly.
The UTF-8 Resolution (Partial)
UTF-8 was designed specifically to be self-synchronizing and ASCII-compatible. A UTF-8 decoder can detect its own encoding by checking byte-length prefixes, and the byte patterns of multi-byte UTF-8 sequences do not overlap with ASCII. This makes UTF-8 much more resistant to misidentification than competing encodings.
The HTML5 specification (2014) mandated that undeclared HTML pages be treated as UTF-8, eliminating the largest single source of charset ambiguity on the web. As of the mid-2020s, approximately 98% of websites declare UTF-8, and the percentage of web traffic in UTF-8 is even higher.
But mojibake has not disappeared entirely. Legacy databases storing Latin-1 or Shift_JIS data continue to generate encoding mismatches when queried by UTF-8-expecting applications. Old email archives contain decades of incorrectly labeled messages. And any system that handles user-generated text without rigorous encoding validation can still produce 文字化け.
The Cultural Significance of a Word
The fact that Japanese has a specific, widely-used term for this phenomenon — mojibake — reflects the disproportionate impact of encoding failures on non-Latin writing systems. A document in English would survive most encoding mismatches intact, because ASCII bytes are the same across nearly all encodings. A document in Japanese, Chinese, Arabic, or Russian had no such luck. The people most harmed by the Babel of character encodings were those who were not already writing in English — the exact people whose languages Unicode was ultimately designed to serve.
More in Unicode History & Culture
ASCII was created in 1963 by the American Standards Association to standardize …
EBCDIC (Extended Binary Coded Decimal Interchange Code) was IBM's character encoding used …
The Unicode Consortium is the non-profit organization responsible for developing and maintaining …
Adding a new character to Unicode requires submitting a detailed proposal to …
Getting a new emoji into Unicode requires a formal proposal to the …
CJK unification was Unicode's decision to assign the same code points to …
From the first Unicode draft in 1988 to the addition of emoji, …
Before Unicode became universal, the web was fragmented by incompatible national encodings …
Unicode is full of surprising, obscure, and occasionally humorous characters — from …