The Unicode Odyssey · Kapitel 1

The Problem: Why We Need Unicode

Before Unicode, every language needed its own encoding — leading to the chaos of mojibake and incompatible systems. This chapter explores the fragmented world of code pages and why a universal standard became essential.

~3.500 Wörter · ~14 Min. Lesezeit · · Updated

Imagine opening an email from a Japanese colleague and seeing a screen full of question marks, boxes, and random Latin characters where kanji should be. Or downloading a Russian novel and finding every Cyrillic letter replaced by garbage like привет. Or writing software for a French government agency only to discover that the é in "données" becomes a ÿ on some machines and an ∩ on others. This was not a thought experiment — it was the everyday reality of computing before Unicode arrived to save us from ourselves.

The Tower of Babel Problem

Before the mid-1990s, computers stored text as sequences of bytes, and every region of the world essentially invented its own mapping from byte values to characters. A byte value of 0xE9 might mean "é" (e with acute accent) in Western Europe, "й" (Cyrillic short i) in Russia, "㊙" (a Japanese enclosed ideograph) in Tokyo, and something else entirely in Seoul. The byte itself was meaningless without knowing which code page was in use — and that metadata was almost never reliably transmitted alongside the data.

The fundamental issue was that a single byte (8 bits) can only represent 256 distinct values. ASCII, the American Standard Code for Information Interchange, defined 128 of those positions (0–127) for English letters, digits, punctuation, and control characters. That left 128 slots (128–255) open, and the world filled them differently.

ASCII: Great for Americans, Terrible for Everyone Else

ASCII (finalized in 1963, updated in 1967) mapped the characters that mattered to American English: the 26 lowercase and 26 uppercase Latin letters, 10 digits, punctuation, and 33 control characters like tab, newline, and carriage return. Code point 65 was 'A', 97 was 'a', 10 was newline. Clean, simple, and completely useless for anybody who needed to write "Ångström", "façade", "Zürich", "naïve", or literally any word from any non-English language.

American engineers at the time largely shrugged — the United States had only one natural language to worry about, and ASCII served it perfectly. The rest of the world had to fend for itself.

The Great Code Page Wars

What followed was decades of incompatible regional encoding standards, each solving the local problem while creating chaos at every international boundary.

The ISO 8859 Family

The International Organization for Standardization created the ISO 8859 series to extend ASCII for different language groups:

  • ISO 8859-1 (Latin-1): Western European languages — French, German, Spanish, Portuguese, Italian, Dutch. Covered é, à, ü, ñ, and similar characters.
  • ISO 8859-2 (Latin-2): Central European languages — Czech, Polish, Hungarian, Croatian. Covered ě, ą, ő, č.
  • ISO 8859-5: Cyrillic alphabet for Russian, Bulgarian, Serbian.
  • ISO 8859-6: Arabic.
  • ISO 8859-7: Greek.
  • ISO 8859-8: Hebrew.
  • ISO 8859-9 (Latin-5): Turkish (Latin-1 variant with Turkish-specific letters).
  • ISO 8859-15: A revision of Latin-1 that added the Euro sign (€) — because Latin-1 was defined before the Euro existed.

The ISO 8859 family covered more ground than ASCII, but it fragmented the world into silos. A document encoded in ISO 8859-2 (Czech) used byte values in the 0x80–0xFF range for Slavic characters. The same byte values in ISO 8859-1 (Western European) meant completely different characters. There was no way for software to know which was which unless the encoding was declared — and it rarely was.

East Asian Complexity: Multi-Byte Solutions

The situation was even more complex for East Asian languages. Chinese, Japanese, and Korean contain tens of thousands of characters — far beyond what a single byte could represent. These languages required multi-byte encoding schemes.

Shift_JIS (Shift Japanese Industrial Standards) was the dominant encoding for Japanese on Windows and early Japanese computers. It used a mixture of single-byte characters (for ASCII-compatible content) and double-byte characters for kanji, hiragana, and katakana. The "shift" referred to how the first byte's value determined whether the next byte was part of a two-byte sequence. Parsing Shift_JIS correctly required state-machine logic — and getting it wrong was common.

EUC-JP (Extended Unix Code for Japanese) was the Unix world's answer to Japanese encoding, using a different byte-range convention than Shift_JIS but covering similar characters. Japanese text from one system often rendered as garbage on another.

Big5 served Traditional Chinese (used in Taiwan and Hong Kong), while GB 2312 and its extension GBK covered Simplified Chinese (mainland China). A Chinese document's meaning depended entirely on whether it was encoded in Big5 or GBK — and the byte sequences overlapped in ways that could produce plausible-looking garbage in the wrong decoder.

KOI8-R served Russian on Unix systems, while Windows machines used Windows-1251 (a Microsoft superset of ISO 8859-5) for Cyrillic. A KOI8-R file opened in Windows-1251 context didn't just show wrong characters — it showed different wrong characters depending on a lookup table that nobody had memorized.

Mojibake: The Symptom of the Disease

The Japanese coined a perfect term for the garbled text that resulted from encoding mismatches: 文字化け (mojibake, literally "character transformation" or "character ghost"). The word captured something important — the corrupted text wasn't random noise, it was transformed text, coherent data rendered through the wrong lens.

Mojibake became so common that programmers developed an almost instinctive pattern recognition for it. The sequence "Ð" followed by Cyrillic-range bytes typically meant UTF-8 Russian text being read as Windows-1252. The pattern "‚" appearing where quotation marks should be? Windows-1252 smart quotes being read as Latin-2. Every encoding mismatch had its own characteristic fingerprint of corruption.

The economic cost was real. Customer databases corrupted beyond recovery. Legal documents submitted in the wrong encoding and rendered illegible by the receiving court's systems. E-commerce transactions failing because a customer's name contained a diacritic. International software projects where the French and German teams' source code comments were unreadable to each other — or worse, appeared to compile but introduced subtle bugs because an identifier contained a byte sequence that meant something different in the other developer's editor.

Windows Code Pages: Microsoft's Patch Over the Wound

Microsoft's approach to the encoding problem was pragmatic but contributed to fragmentation. Windows assigned a "code page" number to each regional encoding standard, plus several proprietary ones:

Code Page Encoding Region
1252 Windows Western European US, UK, Western Europe
1251 Windows Cyrillic Russia, Eastern Europe
1250 Windows Central European Czech, Polish, Hungarian
1256 Windows Arabic Arabic-speaking regions
932 Shift_JIS (extended) Japan
936 GBK / Simplified Chinese China
949 EUC-KR / Korean Korea

The system worked reasonably well within a single country and within the Microsoft ecosystem, but it meant that text created on a Western European Windows machine and sent to a Cyrillic Windows machine would corrupt every character above ASCII value 127. The code page was stored as a system setting, not per-file — so opening a file created on another machine was a gamble.

The Notepad Test: A Classic Encoding Bug

A famous (and somewhat embarrassing) bug in Windows Notepad persisted for years: if you typed the string Bush hid the facts and saved it as ANSI, then reopened it, the text would display as Chinese characters. The reason was that Notepad's encoding auto-detection algorithm read the byte sequence and incorrectly identified it as GBK-encoded Chinese. The bug was harmless but illustrative — even the operating system's own text editor couldn't reliably handle encoding detection.

The Internet Changes Everything

The chaos of the code page era was manageable when computers were isolated machines or part of regional networks. The internet demolished those regional boundaries. Email crossed continental lines. Web pages were served from servers in one country to browsers in another. FTP transfers carried files between operating systems with no mechanism for encoding metadata.

The MIME standard attempted to address email encoding with headers like Content-Type: text/plain; charset=iso-8859-1, but these headers were often missing, wrong, or ignored. HTML gained the <meta charset> tag for the same reason — and it was optional, frequently absent, and sometimes set incorrectly.

Early web browsers had encoding menus with dozens of options (Internet Explorer's "View > Encoding" submenu listed over 30 choices) because auto-detection was unreliable and users frequently had to manually select the right encoding to read a page. This was not a power-user feature — ordinary people had to know what ISO 8859-2 was.

The Human Cost of Encoding Chaos

Beyond the technical inconvenience, the encoding wars had real human consequences:

Marginalization of non-ASCII languages: Software that "just worked" for English speakers frequently failed for speakers of other languages. This created a perception — sometimes internalized by non-English speakers themselves — that computing was an English-language activity, and that their languages were second-class citizens of the digital world.

Data loss at borders: Government databases, healthcare records, and legal systems routinely corrupted names containing diacritics or non-Latin characters. People named "Björn" or "Ángela" or "Müller" would find their names stored as "Bjorn", "Angela", or "Muller" — or as corrupted garbage — after passing through systems that couldn't handle their character sets.

The ASCII fallback habit: Developers learned to avoid non-ASCII characters in identifiers, file names, and configuration files — a habit that shaped programming culture for decades and still influences coding standards today, even when there's no longer any technical justification for it.

The Moment the Industry Had to Act

By the late 1980s, the problem was clearly unsolvable through further code page proliferation. The number of encoding standards in use was already in the dozens. Harmonizing them was politically impossible — too many vendors had too much invested in their particular solutions. The only path forward was to start fresh with a universal system.

In 1987, engineers at Xerox and Apple began collaborating on what would become Unicode. The core insight was radical: instead of 256 character positions shared among competing regional standards, what if you had a single encoding with enough positions for every character in every human writing system, ever?

That audacious ambition would require solving not just the byte-count problem, but fundamental questions about what a "character" even is, how to efficiently encode a 100,000+ character set without wasting memory, and how to handle writing systems as different as left-to-right Latin, right-to-left Arabic, top-to-bottom Chinese, and the mixed-direction bidirectionality of multilingual text.

The chaos of the code page era was the problem. Unicode was — and continues to be — the answer. And as we'll explore in the chapters ahead, that answer turned out to be far more interesting, and far more complex, than anyone anticipated.