The Code Page Explosion — The Encoding Wars

The 1980s were the golden age of personal computing and the dark age of character encoding. As computers escaped the mainframe rooms and landed on desks in France, Germany, Japan, Saudi Arabia, and everywhere else, a crisis of Babylonian proportions erupted. ASCII's 95 printable characters were sufficient for English. They were catastrophically insufficient for everyone else. And in the absence of any coordinated solution, every country, every company, every operating system vendor solved the problem in their own incompatible way. The result was a decade-long encoding catastrophe that made digital communication across language boundaries an exercise in frustration, occasional comedy, and enormous financial waste.

The chaos had a name: code pages. And understanding the code page explosion requires understanding both the technical constraint that created it and the political-commercial reality that prevented any clean solution.

The Eighth Bit

ASCII used 7 bits, leaving the 8th bit of a byte unused. Every byte transmitted or stored had, effectively, a free slot — one bit that ASCII never touched. Hardware vendors in the late 1970s began enabling the 8th bit on their terminals and printers. Operating system vendors began allowing 8-bit characters through their I/O layers. The question of what those extra 128 characters should be became urgent.

The ISO 8859 family was the most organized attempt to manage this expansion. ISO 8859-1 (Latin-1), finalized in 1987, added the Western European characters needed for French, German, Spanish, Portuguese, Dutch, Danish, Norwegian, Swedish, and Italian — the accented vowels, the ç, the ñ, the ü, the ß, the Danish ø, the Icelandic þ, and the various currency symbols. ISO 8859-2 covered Central European languages (Polish, Czech, Slovak, Hungarian, Romanian). ISO 8859-3 covered Southern European languages. ISO 8859-4 covered Baltic languages. ISO 8859-5 covered Cyrillic (Russian, Bulgarian, Serbian). ISO 8859-6 covered Arabic. ISO 8859-7 covered Greek. ISO 8859-8 covered Hebrew. ISO 8859-9 covered Turkish. By the time the family was complete in the late 1990s, there were 15 parts, each covering a different script or regional variant.

The ISO 8859 family had clear logic: it preserved ASCII in the first 128 positions (codepoints 0-127) and used the upper 128 positions (128-255) for language-specific characters. A file containing only ASCII characters would be correctly interpreted regardless of which ISO 8859 variant was assumed. A file containing accented French characters would be correct in ISO 8859-1, ISO 8859-3, and ISO 8859-15, but would appear as garbage in ISO 8859-5 (Cyrillic) or ISO 8859-8 (Hebrew), because the upper 128 positions were mapped to completely different characters.

The critical flaw was that there was no mandatory, machine-readable way to declare which code page a file or data stream used. You had to know, through some out-of-band mechanism, what encoding the author had used. When that information was missing or wrong — which happened constantly, especially across organizational and national boundaries — text became mojibake, the Japanese term for the garbled characters that appear when text is decoded with the wrong encoding. The word mojibake (文字化け) literally means "character transformation," and it became the universal term for encoding errors because Japanese speakers encountered the problem more acutely and more frequently than almost anyone else.

Windows and the Proprietary Code Pages

Microsoft, as it became the dominant PC operating system vendor in the 1980s and 1990s, made the code page situation worse in one respect and more consistent in another. Worse, because Microsoft created its own code pages that diverged from ISO standards in subtle but important ways. More consistent, because within the Windows ecosystem, at least, a single set of code pages was used uniformly.

Windows Code Page 1252 (Windows-1252) was Microsoft's Western European code page. It was derived from ISO 8859-1 but added characters in the range 0x80-0x9F that ISO 8859-1 left as undefined control characters. Microsoft used these 32 positions for typographically useful characters: the Euro sign (€, added later), curly quotation marks (" "), the en dash (–), the em dash (—), the ellipsis (…), and others. The result was that "Latin-1" text could mean either ISO 8859-1 or Windows-1252, and the two were incompatible in the 0x80-0x9F range. A file created on Windows with smart quotes (curly " marks) would display as garbage on a system expecting pure ISO 8859-1, because ISO 8859-1 left those positions undefined.

Web developers still encounter this confusion today. The HTML specification for years specified that a page declared as "iso-8859-1" should actually be decoded as Windows-1252, because so much content labeled as Latin-1 was actually Windows-1252. This workaround — officially treating "iso-8859-1" as an alias for "windows-1252" — is a direct scar from the code page explosion era, preserved in the HTML living standard.

The East Asian Encoding Wars

For East Asian languages, the code page situation was far more severe. Japanese, Chinese, and Korean require thousands of characters — Japanese alone uses roughly 2,000 characters in everyday educated writing, and the full repertoire needed for literature, history, and technical writing runs to tens of thousands. These simply cannot fit in 256 positions. The industry responded with multi-byte encodings, where some bytes were interpreted as single characters and others as the first byte of a two-byte sequence.

Shift_JIS (developed by ASCII Corporation and Microsoft, standardized in the 1980s) was arguably the most technically clever and most problematic encoding ever designed. It used a scheme where bytes in the ranges 0x81-0x9F and 0xE0-0xEF were the first bytes of two-byte character sequences, while other bytes were single characters (ASCII in the 0x00-0x7F range, half-width katakana in the 0xA1-0xDF range). The result was an encoding that could represent approximately 7,000 Japanese characters while remaining backward-compatible with ASCII in the lower 128 positions.

The problem was subtle and devastating: the second byte of a two-byte Shift_JIS sequence could contain the byte value 0x5C, which is ASCII backslash. A naive program scanning through a Shift_JIS string looking for \\ as a path separator or escape character would find a byte 0x5C that was actually the second byte of a Japanese character, misidentify it as a backslash, and do the wrong thing. This bug — sometimes called the "Shift_JIS backslash problem" — affected countless Windows programs. It was a bug that could not be fixed in a single program; it required every string-processing function to be aware that it was handling Shift_JIS and treat backslash-search accordingly. Many programs got this wrong. Some still do.

Japan had three competing incompatible encodings simultaneously in widespread use. Shift_JIS was used by Microsoft and most PC software. EUC-JP (Extended Unix Coding for Japanese, standardized by the Open Software Foundation) was used by Unix systems and was easier to process correctly than Shift_JIS. ISO-2022-JP was used in email because it was 7-bit safe — it used escape sequences to switch between ASCII and Japanese character sets, avoiding the 8-bit transmission issues that affected other encodings. Japanese users routinely encountered the question "Which encoding is this file in?" as a part of daily computing life, and the answer was rarely obvious.

China had its own parallel chaos, compounded by the political division between mainland China, which used simplified Chinese characters, and Taiwan and Hong Kong, which used traditional Chinese characters. GB2312 (1980) was the mainland Chinese national standard, covering 6,763 simplified Chinese characters arranged in 94 rows of 94 columns. GBK (1993, also known as Codepage 936) extended it to cover additional characters including traditional Chinese characters and the Euro sign. GB18030 (2000, 2005, 2022) further extended the encoding to eventually cover all of Unicode. Taiwan and Hong Kong used Big5, an industry-developed encoding from 1984 that covered traditional Chinese characters with a completely different byte structure. The two encoding families were mutually incompatible, and sending a document from Shanghai to Taipei or vice versa required conversion — if you knew which encoding each side was using.

Korea used EUC-KR and Windows Code Page 949 (MS949) — similar but not identical encodings. EUC-KR covered the 2,350 characters of the KS X 1001 basic Korean character set, while MS949 extended it to cover all 11,172 Hangul syllable combinations. A file created on a Korean Windows machine might contain characters that had no representation in EUC-KR, making the file unreadable on Unix-based Korean systems expecting EUC-KR. These were not theoretical incompatibilities; they caused real data loss in international business communications.

The Email Catastrophe

International email in the 1980s and early 1990s was a study in encoding entropy. SMTP (Simple Mail Transfer Protocol), which carried email across the early internet, was designed for 7-bit ASCII. A foundational assumption of SMTP was that every byte in a message would have its high bit clear — that is, every byte would be in the range 0x00 to 0x7F. When 8-bit characters appeared in an email body (because a German speaker included an ü or a French speaker included a é or a Russian speaker included a Cyrillic character), the message might transit through a mail relay that stripped the high bit, converting every accented character to its 7-bit equivalent and turning readable text into garbage.

The stripping of high bits was not malicious — it was a consequence of relay software that was written to the SMTP specification and interpreted "8-bit characters are not allowed" as permission to silently strip the offending bits. The original sender had no way to know whether their message would survive transit intact. Email to international recipients was genuinely unreliable for character-rich text throughout the 1980s.

The MIME standard (Multipurpose Internet Mail Extensions), developed in the early 1990s primarily by Nathaniel Borenstein at Bellcore and Ned Freed at Innosoft, introduced the charset parameter specifically to address this problem. A MIME-encoded email could declare Content-Type: text/plain; charset=iso-8859-1 and email clients were supposed to decode accordingly. MIME also defined the Quoted-Printable and Base64 content transfer encodings to carry 8-bit data over 7-bit SMTP links.

Quoted-Printable encoded 8-bit characters as =XX hex sequences — the German ü (ISO 8859-1 value 0xFC) would become =FC in the email body. A message written in German with QP encoding might look like: Ich m=F6chte ein Fr=FChst=FCck — barely readable to a human looking at the raw message, but reliably decoded by a QP-aware email client. Base64 encoding was more reliable but completely opaque: the entire message body was encoded as a 65-character alphabet that required decoding before any human could read it. Neither solution was satisfying, but both were necessary.

The Web's Charset Wars

When the World Wide Web emerged in the early 1990s and rapidly became the dominant information medium, it inherited all the encoding chaos of the pre-web era and added new dimensions. An HTML page could declare its encoding in a <meta> tag: <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> (or, in HTML5, the simpler <meta charset="utf-8">). But the browser had to read this declaration, which was inside the page, before it knew what encoding to use to read the page. This circular dependency was a fundamental design flaw: you needed to decode the page to find the charset declaration, but you needed the charset declaration to decode the page.

Browsers solved this with "charset sniffing" — heuristic algorithms that guessed the encoding from statistical patterns in the byte sequence before the explicit declaration was found or in cases where it was absent. A byte sequence with many values in the range 0xA0-0xBF was probably Western European text. A sequence with many bytes in the range 0x81-0x9F followed by bytes in the range 0x40-0x7E was probably Shift_JIS. A sequence with many consecutive bytes above 0x80 might be Big5 or GB2312 depending on the statistical distribution of values.

Charset detection was both necessary and consistently unreliable. It worked well for common cases and catastrophically for edge cases. Japanese text in an ambiguous encoding combination could be misdetected as Chinese. Russian text could appear as Greek. Short texts were especially vulnerable because the statistical patterns needed for confident detection weren't present. Browser vendors spent enormous engineering effort on charset detection heuristics, published competing detection algorithms, and still produced wrong results often enough that "encoding bug" remained a standard category in browser bug trackers well into the 2000s.

The code page explosion was not a failure of intelligence or goodwill. It was the predictable consequence of an industry growing faster than its standards could manage, combined with the economic incentives of vendors who benefited from proprietary encodings that locked users into their platforms. Every code page was a rational local solution to a real problem. The problem was that the sum of all those local solutions was a global catastrophe.

The solution, when it came, would have to be both technically superior and politically viable across an industry that had spent a decade building incompatible systems. It would need the vision of people who could see, from above the chaos, what a truly universal encoding would look like — and the political skill to bring the industry along.