History of Unicode
Unicode began in 1987 as a collaboration between engineers at Apple and Xerox who wanted to replace hundreds of incompatible encoding systems with a single universal standard. This guide tells the story of how Unicode was created, standardized, and adopted to become the foundation of modern text.
The Unicode Standard is the foundation of virtually all modern text processing, yet the story of how it came to exist is one of frustration, ambition, and decades of painstaking committee work. This guide traces the full arc of Unicode from the pre-Unicode chaos of the 1980s through its creation, early controversies, and evolution into the sprawling standard that now defines over 154,000 characters across 168 scripts.
Before Unicode: The Encoding Tower of Babel
By the mid-1980s, computers around the world were drowning in incompatible character encodings. ASCII, the American Standard Code for Information Interchange, had been the lingua franca of English-language computing since 1963 -- but it only covered 128 characters. Every country and vendor that needed more characters invented its own extension:
| Encoding | Region / Language | Characters |
|---|---|---|
| ASCII | English (US) | 128 |
| ISO 8859-1 (Latin-1) | Western Europe | 256 |
| ISO 8859-5 | Cyrillic (Russian, etc.) | 256 |
| Windows-1252 | Western Europe (Microsoft) | 256 |
| KOI8-R | Russian | 256 |
| Shift_JIS | Japanese | ~7,000 |
| EUC-KR | Korean | ~8,000 |
| GB2312 / GBK | Simplified Chinese | ~7,000 / ~21,000 |
| Big5 | Traditional Chinese | ~13,000 |
| TIS-620 | Thai | 256 |
These encodings were mutually incompatible. A byte value of 0xC0 meant "A with grave accent"
in Latin-1, a Cyrillic letter in ISO 8859-5, and part of a multi-byte Japanese character in
Shift_JIS. The result was mojibake -- garbled text that appeared when data crossed encoding
boundaries. The Japanese term (文字化け, literally "character transformation") became the
universal name for this universal problem.
International organizations like ISO had attempted to create a unified encoding (the draft ISO 10646 project started in 1984), but the early approach tried to use a 32-bit code space with a complex multi-group, multi-plane, multi-row structure that was widely criticized as overengineered and impractical.
1987 -- 1991: The Birth of Unicode
The Xerox-Apple Collaboration
The Unicode project began in 1987 when Joe Becker, a software engineer at Xerox, wrote an internal draft titled "Unicode 88" proposing a universal 16-bit character set. Becker argued that 65,536 code points would be sufficient to encode every character needed for modern communication. He was soon joined by Lee Collins (also at Xerox) and Mark Davis (at Apple), who began collaborating on a concrete specification.
Their key design principles were:
- Universality -- cover all modern scripts in active use
- Efficiency -- use a fixed-width 16-bit encoding (two bytes per character)
- Unambiguity -- every code point maps to exactly one character, and vice versa
- Uniformity -- all characters are treated the same way by protocols, whether they are Latin letters, CJK ideographs, or symbols
The Founding of the Unicode Consortium
In 1988 and 1989, Becker and Davis circulated drafts among interested companies. By 1990, representatives from Xerox, Apple, Sun Microsystems, NeXT, Microsoft, and other companies had formed a working group. On January 3, 1991, the Unicode Consortium was formally incorporated as a non-profit organization in California.
The founding members included:
- Apple Computer -- Mark Davis was a key architect
- Xerox -- Joe Becker and Lee Collins contributed the original design
- Sun Microsystems -- provided engineering support for CJK unification
- NeXT -- Steve Jobs's company adopted Unicode early in NeXTSTEP
- Microsoft -- Windows NT was being designed around a 16-bit character model
- IBM -- brought experience from its EBCDIC and double-byte character set work
Unicode 1.0 (October 1991)
The first version of the Unicode Standard was published in October 1991 as a two-volume book. Unicode 1.0 defined 7,161 characters on a single plane of 65,536 code points (the Basic Multilingual Plane, or BMP). It included:
- Latin, Greek, Cyrillic, Armenian, Georgian, Hebrew, Arabic
- Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam
- Thai, Lao
- CJK Ideographs (20,902 unified characters -- the largest single block)
- Mathematical operators, technical symbols, dingbats
- Control characters inherited from ASCII
The decision to unify Chinese, Japanese, and Korean ideographs (CJK Unification) into a single block was one of the most controversial choices in the standard's history. Characters that were historically distinct in China, Japan, and Korea were merged into single code points based on their abstract shape, even when their local typographic forms differed. This decision saved thousands of code points but generated lasting criticism from East Asian scholars and users who felt that important cultural distinctions were being erased.
1992 -- 1995: Merging with ISO 10646
The Great Merger
In parallel with Unicode, ISO was developing ISO/IEC 10646, a separate universal character set. Having two competing universal standards would have defeated the purpose of universality, so in 1991 the two groups began negotiations to synchronize their repertoires.
The merger was completed by 1993. The deal was:
- ISO 10646 and Unicode would always assign the same code points to the same characters
- ISO 10646 would define the raw character set (the "what")
- The Unicode Standard would add normative properties, algorithms, and implementation guidelines (the "how")
This division of labor persists today. When people say "Unicode," they usually mean the full package: character assignments, properties, encoding forms, and algorithms. ISO 10646 provides the character repertoire that underpins it.
Unicode 1.1 (June 1993)
Version 1.1 added 4,306 characters, bringing the total to 34,168. The main addition was a massive expansion of CJK Unified Ideographs (over 20,000 more) and Hangul Jamo for Korean.
Unicode 2.0 (July 1996) -- Beyond 16 Bits
Unicode 2.0 was a watershed moment. The original design assumption that 65,536 code points would be enough proved wrong. Ancient scripts, rare CJK characters, and the growing demand for completeness required more space.
Unicode 2.0 introduced the surrogate mechanism: a range of code points (U+D800 to U+DFFF)
was reserved so that pairs of 16-bit code units could address characters beyond the BMP. This
effectively expanded the code space to 17 planes of 65,536 code points each, giving a total of
1,114,112 possible code points (U+0000 to U+10FFFF).
Version 2.0 also completely redesigned the Hangul block, replacing the 6,656 pre-composed Hangul syllables from Unicode 1.1 with 11,172 algorithmically composed syllables. This was the only time Unicode broke backward compatibility, and it required a new encoding form: UTF-16.
1996 -- 2000: Encoding Wars and Industry Adoption
UTF-8 Changes Everything
While Unicode defined the characters, the question of how to store them in bytes was hotly debated. Three encoding forms competed:
| Encoding | Unit Size | BMP Characters | Supplementary Characters | ASCII Compatibility |
|---|---|---|---|---|
| UTF-32 | 4 bytes | 4 bytes each | 4 bytes each | No |
| UTF-16 | 2 bytes | 2 bytes each | 4 bytes (surrogate pair) | No |
| UTF-8 | 1 byte | 1--3 bytes each | 4 bytes each | Yes |
UTF-8, originally designed by Ken Thompson and Rob Pike in 1992 for the Plan 9 operating system, had a critical advantage: it was backward-compatible with ASCII. Every valid ASCII file was also a valid UTF-8 file, byte for byte. This meant existing English-language software and protocols could adopt UTF-8 without modification.
UTF-8 was slow to gain traction in the 1990s (Microsoft bet on UTF-16 for Windows NT, Java chose UTF-16 internally), but by the early 2000s the tide turned decisively. HTML 4.0 (1997) recommended UTF-8. XML 1.0 (1998) made UTF-8 the default. The web made UTF-8 the winner.
Platform Adoption
- Windows NT (1993) was the first major OS to use Unicode internally, storing strings as UTF-16 (then called UCS-2). Every version of Windows since has been Unicode-native.
- Java (1995) made
chara 16-bit Unicode code unit (UTF-16). Java 5 (2004) added supplementary character support via surrogate pairs. - Python 2 (2000) added a
unicodetype alongsidestr. Python 3 (2008) made all strings Unicode by default. - ICU (International Components for Unicode) -- an open-source library from IBM, first released in 1999 -- provided production-quality Unicode algorithms (collation, normalization, bidirectional text) that most platforms eventually adopted.
2000 -- 2010: Expanding the World's Scripts
Unicode 3.0 -- 4.0: Historic Scripts and Supplementary Planes
| Version | Year | Total Characters | Notable Additions |
|---|---|---|---|
| 3.0 | 1999 | 49,259 | Cherokee, Ethiopic, Khmer, Mongolian, Myanmar, Sinhala |
| 3.1 | 2001 | 94,205 | First supplementary plane characters: Deseret, Gothic, Old Italic, musical symbols, CJK Extension B (42,711 ideographs) |
| 3.2 | 2002 | 95,221 | Philippine scripts (Buhid, Hanunoo, Tagalog, Tagbanwa) |
| 4.0 | 2003 | 96,447 | Cypriot, Limbu, Linear B, Osmanya, Shavian, Tai Le, Ugaritic |
Unicode 3.1 (2001) was a landmark: it was the first version to assign characters outside the BMP, using Plane 1 (the Supplementary Multilingual Plane) and Plane 2 (the Supplementary Ideographic Plane). This validated the expansion to 17 planes that Unicode 2.0 had made possible.
Unicode 5.0 -- 6.0: Accelerating Coverage
| Version | Year | Total Characters | Notable Additions |
|---|---|---|---|
| 5.0 | 2006 | 99,089 | N'Ko, Balinese, Phags-pa, Phoenician, Cuneiform, currency symbol additions |
| 5.1 | 2008 | 100,713 | Carian, Cham, Kayah Li, Lepcha, Lycian, Lydian, Ol Chiki, Rejang, Saurashtra, Sundanese, Vai |
| 5.2 | 2009 | 107,361 | Egyptian Hieroglyphs (1,071 characters), Bamum, Javanese, Lisu, Meetei Mayek, Samaritan, Tai Tham, Tai Viet |
| 6.0 | 2010 | 109,449 | Indian Rupee Sign (U+20B9), Mandaic, Batak, Brahmi, emoji (first official batch) |
Unicode 5.2's addition of Egyptian Hieroglyphs was symbolically important: it demonstrated that Unicode was committed to encoding not just living languages but the full breadth of human writing history.
2010 -- 2020: The Emoji Era
How Emoji Changed Unicode Forever
Unicode 6.0 (2010) was a turning point. Japanese mobile carriers (NTT DoCoMo, KDDI, SoftBank) had been using proprietary emoji sets since the late 1990s, and when Apple added emoji support to iOS in 2008 (initially for the Japanese market), demand exploded globally. Google and Apple petitioned the Unicode Consortium to standardize emoji, arguing that interoperability required a single universal encoding.
The Consortium agreed, and 722 emoji were added in Unicode 6.0. This decision transformed Unicode from an obscure technical standard into a cultural phenomenon. Suddenly, debates about which new emoji to encode made international headlines, and the Consortium's annual emoji announcements became mainstream news events.
Emoji Skin Tones and ZWJ Sequences
Unicode 8.0 (2015) introduced skin tone modifiers (based on the Fitzpatrick dermatology scale), allowing five skin tone variants for human emoji. Unicode 11.0 (2018) and later versions expanded ZWJ (Zero Width Joiner) sequences, enabling combinations like family groups (👨\u200D👩\u200D👧\u200D👦), professions (👩\u200D🔬 "woman scientist"), and flags.
ZWJ sequences are significant because they allow new "characters" to be created by combining existing code points, without consuming new code point slots. This mechanism has become the primary way new emoji are introduced.
The Emoji Subcommittee
The enormous public interest in emoji led the Consortium to create a dedicated Emoji Subcommittee (now called the Emoji Standing Committee) to manage proposals. The committee evaluates proposals based on criteria including expected usage frequency, distinctiveness, and compatibility. Thousands of emoji proposals are submitted each year; only a few dozen are approved per release.
2020 -- Present: Maturity and Ongoing Challenges
Recent Milestones
| Version | Year | Total Characters | Notable Additions |
|---|---|---|---|
| 13.0 | 2020 | 143,859 | Yezidi, Chorasmian, Elymaic, Nandinagari, Wancho, 55 new emoji |
| 14.0 | 2021 | 144,697 | Toto, Cypro-Minoan, Old Uyghur, Tangsa, Vithkuqi, 37 new emoji |
| 15.0 | 2022 | 149,186 | Kawi, Nag Mundari, 31 new emoji, CJK Extension I (622 ideographs) |
| 15.1 | 2023 | 149,813 | 118 new CJK ideographs, no new emoji (maintenance release) |
| 16.0 | 2024 | 154,998 | Garay, Gurung Khema, Kirat Rai, Ol Onal, Sunuwar, Todhri, Tulu-Tigalari, Egyptian Hieroglyphs Extended-A (991 characters), 7 new emoji |
Unicode 16.0 brought the total to nearly 155,000 characters covering 168 scripts. The standard now encodes scripts ranging from the oldest known writing (Sumerian Cuneiform, ~3400 BCE) to scripts that were invented in the 21st century (Adlam, created around 2011 for the Fulani language).
Ongoing Controversies and Challenges
CJK Unification remains contentious. While the technical arguments for unification are sound (it avoids duplicating tens of thousands of code points), users in China, Japan, and Korea continue to argue that locale-specific glyph variants should be encoded separately rather than relying on font selection.
Script equity is another concern. European scripts were encoded first and most completely, while many African, Southeast Asian, and indigenous scripts waited decades. Organizations like the Script Encoding Initiative (based at UC Berkeley) work to prepare proposals for underrepresented scripts.
Emoji governance has attracted criticism from those who feel the Consortium spends disproportionate resources on emoji at the expense of script encoding. The Consortium has pushed back, noting that emoji generate significant public interest and funding that supports all encoding work.
Security is an ongoing battle. Unicode's enormous repertoire enables homograph attacks (using look-alike characters from different scripts to spoof domain names) and invisible text attacks (using zero-width characters to hide content). The Unicode Consortium publishes Technical Reports on security considerations, but the arms race between attackers and defenders continues.
The People Behind Unicode
While Unicode is a collective effort, several individuals deserve special recognition:
- Joe Becker (Xerox) -- conceived the original Unicode idea and wrote "Unicode 88"
- Lee Collins (Xerox) -- co-designed the first specification
- Mark Davis (Apple, later Google) -- co-founder, long-serving president of the Consortium, primary author of many Unicode algorithms (CLDR, ICU)
- Ken Whistler -- has worked on Unicode since the earliest days, maintains the Unicode Character Database
- Asmus Freytag -- major contributor to Unicode algorithms and the ISO 10646 synchronization
- Ken Thompson and Rob Pike -- designed UTF-8 at Bell Labs in 1992
- Michael Everson -- prolific proposer of scripts, responsible for dozens of historic and minority script additions
Key Takeaways
- Unicode was born from the encoding chaos of the 1980s, when hundreds of incompatible character sets made international text interchange nearly impossible.
- The standard was created through collaboration between engineers at Xerox, Apple, Sun, and Microsoft starting in 1987, and the Unicode Consortium was incorporated in 1991.
- The original assumption that 65,536 code points would suffice was wrong; Unicode 2.0 (1996) expanded the code space to 1,114,112 code points across 17 planes.
- UTF-8 -- designed by Ken Thompson and Rob Pike in 1992 -- won the encoding war thanks to its backward compatibility with ASCII, and today accounts for over 98% of web pages.
- The addition of emoji starting in Unicode 6.0 (2010) transformed the Consortium from a niche standards body into a cultural institution.
- As of Unicode 16.0 (2024), the standard defines nearly 155,000 characters across 168 scripts, with ongoing work to encode the remaining unrepresented writing systems of the world.
- The synchronization between Unicode and ISO/IEC 10646 ensures that there is truly one universal character set, even though the two standards are maintained by different organizations.
Unicode Fundamentals 中的更多内容
Unicode is the universal character encoding standard that assigns a unique number …
UTF-8 is the dominant character encoding on the web, capable of representing …
UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …
A Unicode code point is the unique number assigned to each character …
Unicode is divided into 17 planes, each containing up to 65,536 code …
The Byte Order Mark (BOM) is a special Unicode character used at …
Surrogate pairs are a mechanism in UTF-16 that allows code points outside …
ASCII defined 128 characters for the English alphabet and was the foundation …
The same visible character can be represented by multiple different byte sequences …
The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …
Every Unicode character belongs to a general category such as Letter, Number, …
Unicode blocks are contiguous ranges of code points grouped by script or …
Unicode assigns every character to a script property that identifies the writing …
Combining characters are Unicode code points that attach to a preceding base …
A single visible character on screen — called a grapheme cluster — …
Unicode confusables are characters that look identical or nearly identical to others, …
Zero-width characters are invisible Unicode code points that affect text layout, joining, …
Unicode defines over two dozen whitespace characters beyond the ordinary space, including …
Unicode has released major versions regularly since 1.0 in 1991, with each …