The Encoding Wars · Chapter 4

The Unicode Vision

In 1987, Joe Becker and Lee Collins at Xerox imagined a single encoding for all the world's characters. This chapter tells the story of the Unicode Consortium's founding and its merger with ISO 10646.

~4,000 words · ~16 min read · · Updated

In 1987, three engineers at different companies had the same impossible idea at roughly the same time. Joe Becker at Xerox, Lee Collins at Apple, and Mark Davis at Apple looked at the chaos of incompatible character encodings — the dozens of code pages, the Shift_JIS and EUC-JP incompatibilities, the ISO 8859 family's fifteen-part non-answer — and proposed something that seemed, at first hearing, almost grandiose: a single character encoding that would contain every character used in every language in the world, with no incompatibilities and no ambiguity.

They called it Unicode.

The name was, Becker later explained, intended to capture the three core properties of the proposed encoding: universal (covering all the world's characters), uniform (fixed-width characters for easy processing), and unique (each character has exactly one encoding). Whether the name perfectly captured the eventual design is debatable — Unicode turned out to be variable-width in practice, via UTF-8 and UTF-16 — but the vision it expressed was clear. The encoding chaos that had plagued computing since ASCII's 95-character limitation had to end. Only a truly universal standard could end it.

The 1987 Vision Documents

Joe Becker's internal Xerox memo, circulated in 1987 and later published in the Unicode newsletter, opened with a sentence that crystallized the problem: "The net result of all the proposed standards and proposals I've seen is a Tower of Babel of incompatible character set definitions, each with its own merits and with its own problems." Becker called for a 16-bit character encoding — 65,536 possible code points — as "clearly enough for all practical purposes." This phrase, "clearly enough for all practical purposes," would come back to haunt the Unicode project as it discovered, years later, that 65,536 was not in fact enough.

Becker's memo was specific about what Unicode needed to accomplish: it needed to cover all the characters in commercial and scientific use across all major world languages. It needed to be simple enough for software developers to implement correctly. It needed to be efficient enough for the hardware of the era to handle without special coprocessors. And it needed to be backward-compatible with ASCII, because breaking ASCII compatibility would make Unicode impossible to adopt in practice.

Collins and Davis at Apple were working on similar ideas, driven by Apple's specific business need. The Macintosh was being sold worldwide, and Apple's software — particularly its operating system and its productivity applications — needed to display and process text in Japanese, Chinese, Arabic, and dozens of other languages. Apple's internal multilingual text system, while technically sophisticated, was built on proprietary encodings that didn't interoperate with non-Apple systems. The company needed a standard that the whole industry would adopt.

The three men met, combined their work, and brought in other collaborators. The first public presentation of Unicode was at the Unicode workshop at a Xerox research center in January 1988. By 1991, with the founding of the Unicode Consortium as a formal nonprofit organization, the project had the organizational structure needed to negotiate with ISO, recruit member companies, and publish authoritative specifications. The founding members included Apple, Aldus, Go Corporation, Hewlett-Packard, IBM, Microsoft, NeXT, Oracle, RosettaStone, Sun Microsystems, Taligent, WordPerfect, and Xerox — a cross-industry coalition of unusual breadth united by the recognition that the encoding chaos was costing everyone money and credibility.

The Founding Design Principles

The Unicode Standard, First Edition (1991) articulated a set of design principles that have shaped every subsequent version. These principles were not derived from abstract theory but from the hard-won experience of the founding members with real encoding problems in real software.

Universality meant that Unicode would encode every character in every script used for written human communication, past or present. Not just the major world languages, but minority languages, historical scripts (Egyptian hieroglyphs, Sumerian cuneiform), mathematical notation, musical notation, game symbols, and technical symbols. Every human writing system that had been documented and that could be described in terms of discrete characters was a candidate for encoding. This was a more ambitious goal than any previous encoding had attempted, and it would prove to require far more than the original 65,536 code points.

Efficiency meant that the most common operations on text — comparison, searching, sorting, copying, iteration — should be as fast as possible on the hardware of the era. This requirement drove the early "16-bit is enough" assumption: if every character had the same width (16 bits, or two bytes), then counting characters, indexing into strings, and iterating over text would all be simple array operations with O(1) random access. Variable-width encodings like Shift_JIS required complex state machines to parse correctly because you couldn't determine the character boundary without reading from the beginning or a known synchronization point. A fixed-width 16-bit encoding would process as fast as a fixed-width 8-bit ASCII encoding, just with 2× the memory.

Characters, not glyphs: Unicode explicitly distinguishes between abstract characters (the units of the encoding) and the visual glyphs used to render them. The letter "A" is a single Unicode character (U+0041) regardless of whether it is rendered in Times New Roman, Helvetica, Comic Sans, or a handwritten font. This principle allows the same text to be rendered differently in different contexts without the encoding changing. It also means that Unicode does not try to encode every typographic variation — that's the job of font technologies like OpenType and AAT.

Unification was the most controversial principle. Unicode's designers decided that a single code point should represent a character regardless of the language using it — a concept called Han unification. The Chinese character for "person" (人), the Japanese character 人, and the Korean character 人 all have the same origin, the same meaning, and (with minor calligraphic variations) the same visual form. Unicode assigned them a single code point: U+4EBA.

Han unification was logical from an information-theory perspective and politically explosive from a cultural one. Japanese typographers and scholars pointed out that the "same" character has different standard forms in Chinese versus Japanese contexts — differences that are invisible in everyday use but important in formal typography and in educational materials. The character 辺 has a slightly different stroke order and standard form in Japan than its Chinese equivalent 边. These are not mere font variations; they reflect different national typographic standards with the force of law and tradition behind them. The unification controversy remains the single most contested technical and cultural argument in Unicode's history.

The 16-Bit Miscalculation

The original Unicode design, as published in 1991, assumed a 16-bit code space: 65,536 code points, of which approximately 49,000 were assigned in the first version. This seemed generous. Prior encodings had covered at most a few thousand characters. Even accounting for all of Chinese, Japanese, Korean, and all the world's alphabets and syllabaries, 65,536 seemed like it would accommodate everything.

It was not enough.

As Unicode's coverage expanded — as the project added more obscure Latin letters, as it addressed historical scripts like Linear B and Gothic, as mathematical symbols multiplied, as emoji began their explosive growth — the 16-bit ceiling approached. By the mid-1990s it was clear that 65,536 code points would not be sufficient to encode all of humanity's written expression. Unicode 2.0 (1996) addressed this by expanding the code space to 1,114,112 code points — organized into 17 "planes" of 65,536 code points each. The original 65,536 code points became Plane 0, the Basic Multilingual Plane (BMP). Planes 1 through 16 were available for supplementary characters.

This expansion had cascading consequences. UTF-16, the encoding that directly represents Unicode code points as 16-bit units, had to introduce "surrogate pairs" — a mechanism for encoding non-BMP characters using two 16-bit units in the range 0xD800-0xDFFF. These ranges had been reserved in Unicode's first version specifically for this purpose, in case the 16-bit limit proved insufficient. Any software that treated UTF-16 as a fixed-width encoding (as the original Unicode promise implied) broke when it encountered supplementary characters encoded as surrogate pairs. The surrogate pair problem affected Java's String.length(), JavaScript's String.prototype.length, and numerous other string operations across multiple programming languages — a slow-motion compatibility crisis that took decades to address.

The ISO 10646 Merger

While Unicode was being developed in the United States, the International Organization for Standardization had been working in parallel on ISO 10646, the Universal Coded Character Set (UCS). The two projects had the same fundamental goal but were being developed by different organizations with different processes and different constituencies. They risked becoming competing incompatible standards — precisely the outcome Unicode was designed to prevent.

Negotiations between the Unicode Consortium and ISO/IEC JTC 1/SC 2 (the committee responsible for character sets) were conducted between 1989 and 1991. They were, by all accounts, difficult. ISO had its own process, its own structure, and its own institutional inertia. The Unicode Consortium was more nimble but had less international legitimacy. After extended negotiations, the two organizations agreed in 1991 to synchronize their work: the same characters would receive the same code points in both standards, and future character additions would be coordinated. The two standards are now effectively synchronized, with Unicode Version N corresponding to ISO/IEC 10646 Edition M.

The ISO partnership was crucial for adoption outside the United States. Without it, Unicode could have been dismissed by European governments and Japanese standards bodies as a proprietary American technology. With it, Unicode had the backing of the international standards establishment. Governments that required ISO compliance for public procurement could now adopt Unicode-based systems.

Early Adoption Struggles

Having a brilliant standard is not the same as having an adopted standard. Unicode 1.0 was published in 1991, but uptake was slow for years. The computing industry had enormous investments in ASCII-based infrastructure — billions of lines of code, thousands of database schemas, countless file formats — all written assuming that a character was a byte and a byte was a character. Converting to Unicode meant rewriting string libraries, database kernels, operating system APIs, and application code. The cost was high and the immediate benefit was modest, because most applications' users were monolingual.

Microsoft made the boldest early commitment. Windows NT 3.1 (1993) was designed with Unicode (specifically, UCS-2/UTF-16) as the native character encoding for the operating system kernel. The Win32 API offered both "A" (ANSI) versions of functions that accepted 8-bit strings and "W" (Wide) versions that accepted 16-bit Unicode strings. Applications could use either set of APIs, with Windows translating between the current ANSI code page and Unicode internally. The CreateFileA and CreateFileW functions — two versions of the same operation — became symbolic of this transition period, where the industry knew where it needed to go (Unicode) but was not yet there.

Apple was slower. Classic Mac OS used WorldScript, its own multilingual text architecture, which handled Arabic, Hebrew, and East Asian scripts but was not Unicode. Only with Mac OS X (2001), built on the BSD Unix kernel from Apple's NeXT acquisition, did Apple commit to Unicode as the primary text encoding.

Linux and the Unix world were even slower. The C char type was fundamentally one byte, and the assumption that text was a sequence of bytes was deeply embedded in Unix culture and tooling. The GNU C Library's internationalization support was incomplete and inconsistently used throughout the 1990s. Only as web applications and Unicode-aware databases demanded proper Unicode support did the Unix/Linux world fully embrace the standard, a process that stretched into the 2000s.

The vision of Becker, Collins, and Davis in 1987 was vindicated. Their "impossible idea" — one encoding for all of humanity's written expression — had moved from a Xerox research memo to the foundation of the modern internet in the span of fifteen years. But the real test of the vision was the encoding that would actually carry Unicode text across the world's networks. That encoding would not be the fixed-width UCS-2 that the original vision described. It would be something invented at a diner in New Jersey, on a paper placemat, by two engineers who wanted to keep their Unix system from breaking.