📚 Unicode Fundamentals

History of Unicode

Unicode began in 1987 as a collaboration between engineers at Apple and Xerox who wanted to replace hundreds of incompatible encoding systems with a single universal standard. This guide tells the story of how Unicode was created, standardized, and adopted to become the foundation of modern text.

·

The Unicode Standard is the foundation of virtually all modern text processing, yet the story of how it came to exist is one of frustration, ambition, and decades of painstaking committee work. This guide traces the full arc of Unicode from the pre-Unicode chaos of the 1980s through its creation, early controversies, and evolution into the sprawling standard that now defines over 154,000 characters across 168 scripts.

Before Unicode: The Encoding Tower of Babel

By the mid-1980s, computers around the world were drowning in incompatible character encodings. ASCII, the American Standard Code for Information Interchange, had been the lingua franca of English-language computing since 1963 -- but it only covered 128 characters. Every country and vendor that needed more characters invented its own extension:

Encoding Region / Language Characters
ASCII English (US) 128
ISO 8859-1 (Latin-1) Western Europe 256
ISO 8859-5 Cyrillic (Russian, etc.) 256
Windows-1252 Western Europe (Microsoft) 256
KOI8-R Russian 256
Shift_JIS Japanese ~7,000
EUC-KR Korean ~8,000
GB2312 / GBK Simplified Chinese ~7,000 / ~21,000
Big5 Traditional Chinese ~13,000
TIS-620 Thai 256

These encodings were mutually incompatible. A byte value of 0xC0 meant "A with grave accent" in Latin-1, a Cyrillic letter in ISO 8859-5, and part of a multi-byte Japanese character in Shift_JIS. The result was mojibake -- garbled text that appeared when data crossed encoding boundaries. The Japanese term (文字化け, literally "character transformation") became the universal name for this universal problem.

International organizations like ISO had attempted to create a unified encoding (the draft ISO 10646 project started in 1984), but the early approach tried to use a 32-bit code space with a complex multi-group, multi-plane, multi-row structure that was widely criticized as overengineered and impractical.

1987 -- 1991: The Birth of Unicode

The Xerox-Apple Collaboration

The Unicode project began in 1987 when Joe Becker, a software engineer at Xerox, wrote an internal draft titled "Unicode 88" proposing a universal 16-bit character set. Becker argued that 65,536 code points would be sufficient to encode every character needed for modern communication. He was soon joined by Lee Collins (also at Xerox) and Mark Davis (at Apple), who began collaborating on a concrete specification.

Their key design principles were:

  1. Universality -- cover all modern scripts in active use
  2. Efficiency -- use a fixed-width 16-bit encoding (two bytes per character)
  3. Unambiguity -- every code point maps to exactly one character, and vice versa
  4. Uniformity -- all characters are treated the same way by protocols, whether they are Latin letters, CJK ideographs, or symbols

The Founding of the Unicode Consortium

In 1988 and 1989, Becker and Davis circulated drafts among interested companies. By 1990, representatives from Xerox, Apple, Sun Microsystems, NeXT, Microsoft, and other companies had formed a working group. On January 3, 1991, the Unicode Consortium was formally incorporated as a non-profit organization in California.

The founding members included:

  • Apple Computer -- Mark Davis was a key architect
  • Xerox -- Joe Becker and Lee Collins contributed the original design
  • Sun Microsystems -- provided engineering support for CJK unification
  • NeXT -- Steve Jobs's company adopted Unicode early in NeXTSTEP
  • Microsoft -- Windows NT was being designed around a 16-bit character model
  • IBM -- brought experience from its EBCDIC and double-byte character set work

Unicode 1.0 (October 1991)

The first version of the Unicode Standard was published in October 1991 as a two-volume book. Unicode 1.0 defined 7,161 characters on a single plane of 65,536 code points (the Basic Multilingual Plane, or BMP). It included:

  • Latin, Greek, Cyrillic, Armenian, Georgian, Hebrew, Arabic
  • Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam
  • Thai, Lao
  • CJK Ideographs (20,902 unified characters -- the largest single block)
  • Mathematical operators, technical symbols, dingbats
  • Control characters inherited from ASCII

The decision to unify Chinese, Japanese, and Korean ideographs (CJK Unification) into a single block was one of the most controversial choices in the standard's history. Characters that were historically distinct in China, Japan, and Korea were merged into single code points based on their abstract shape, even when their local typographic forms differed. This decision saved thousands of code points but generated lasting criticism from East Asian scholars and users who felt that important cultural distinctions were being erased.

1992 -- 1995: Merging with ISO 10646

The Great Merger

In parallel with Unicode, ISO was developing ISO/IEC 10646, a separate universal character set. Having two competing universal standards would have defeated the purpose of universality, so in 1991 the two groups began negotiations to synchronize their repertoires.

The merger was completed by 1993. The deal was:

  • ISO 10646 and Unicode would always assign the same code points to the same characters
  • ISO 10646 would define the raw character set (the "what")
  • The Unicode Standard would add normative properties, algorithms, and implementation guidelines (the "how")

This division of labor persists today. When people say "Unicode," they usually mean the full package: character assignments, properties, encoding forms, and algorithms. ISO 10646 provides the character repertoire that underpins it.

Unicode 1.1 (June 1993)

Version 1.1 added 4,306 characters, bringing the total to 34,168. The main addition was a massive expansion of CJK Unified Ideographs (over 20,000 more) and Hangul Jamo for Korean.

Unicode 2.0 (July 1996) -- Beyond 16 Bits

Unicode 2.0 was a watershed moment. The original design assumption that 65,536 code points would be enough proved wrong. Ancient scripts, rare CJK characters, and the growing demand for completeness required more space.

Unicode 2.0 introduced the surrogate mechanism: a range of code points (U+D800 to U+DFFF) was reserved so that pairs of 16-bit code units could address characters beyond the BMP. This effectively expanded the code space to 17 planes of 65,536 code points each, giving a total of 1,114,112 possible code points (U+0000 to U+10FFFF).

Version 2.0 also completely redesigned the Hangul block, replacing the 6,656 pre-composed Hangul syllables from Unicode 1.1 with 11,172 algorithmically composed syllables. This was the only time Unicode broke backward compatibility, and it required a new encoding form: UTF-16.

1996 -- 2000: Encoding Wars and Industry Adoption

UTF-8 Changes Everything

While Unicode defined the characters, the question of how to store them in bytes was hotly debated. Three encoding forms competed:

Encoding Unit Size BMP Characters Supplementary Characters ASCII Compatibility
UTF-32 4 bytes 4 bytes each 4 bytes each No
UTF-16 2 bytes 2 bytes each 4 bytes (surrogate pair) No
UTF-8 1 byte 1--3 bytes each 4 bytes each Yes

UTF-8, originally designed by Ken Thompson and Rob Pike in 1992 for the Plan 9 operating system, had a critical advantage: it was backward-compatible with ASCII. Every valid ASCII file was also a valid UTF-8 file, byte for byte. This meant existing English-language software and protocols could adopt UTF-8 without modification.

UTF-8 was slow to gain traction in the 1990s (Microsoft bet on UTF-16 for Windows NT, Java chose UTF-16 internally), but by the early 2000s the tide turned decisively. HTML 4.0 (1997) recommended UTF-8. XML 1.0 (1998) made UTF-8 the default. The web made UTF-8 the winner.

Platform Adoption

  • Windows NT (1993) was the first major OS to use Unicode internally, storing strings as UTF-16 (then called UCS-2). Every version of Windows since has been Unicode-native.
  • Java (1995) made char a 16-bit Unicode code unit (UTF-16). Java 5 (2004) added supplementary character support via surrogate pairs.
  • Python 2 (2000) added a unicode type alongside str. Python 3 (2008) made all strings Unicode by default.
  • ICU (International Components for Unicode) -- an open-source library from IBM, first released in 1999 -- provided production-quality Unicode algorithms (collation, normalization, bidirectional text) that most platforms eventually adopted.

2000 -- 2010: Expanding the World's Scripts

Unicode 3.0 -- 4.0: Historic Scripts and Supplementary Planes

Version Year Total Characters Notable Additions
3.0 1999 49,259 Cherokee, Ethiopic, Khmer, Mongolian, Myanmar, Sinhala
3.1 2001 94,205 First supplementary plane characters: Deseret, Gothic, Old Italic, musical symbols, CJK Extension B (42,711 ideographs)
3.2 2002 95,221 Philippine scripts (Buhid, Hanunoo, Tagalog, Tagbanwa)
4.0 2003 96,447 Cypriot, Limbu, Linear B, Osmanya, Shavian, Tai Le, Ugaritic

Unicode 3.1 (2001) was a landmark: it was the first version to assign characters outside the BMP, using Plane 1 (the Supplementary Multilingual Plane) and Plane 2 (the Supplementary Ideographic Plane). This validated the expansion to 17 planes that Unicode 2.0 had made possible.

Unicode 5.0 -- 6.0: Accelerating Coverage

Version Year Total Characters Notable Additions
5.0 2006 99,089 N'Ko, Balinese, Phags-pa, Phoenician, Cuneiform, currency symbol additions
5.1 2008 100,713 Carian, Cham, Kayah Li, Lepcha, Lycian, Lydian, Ol Chiki, Rejang, Saurashtra, Sundanese, Vai
5.2 2009 107,361 Egyptian Hieroglyphs (1,071 characters), Bamum, Javanese, Lisu, Meetei Mayek, Samaritan, Tai Tham, Tai Viet
6.0 2010 109,449 Indian Rupee Sign (U+20B9), Mandaic, Batak, Brahmi, emoji (first official batch)

Unicode 5.2's addition of Egyptian Hieroglyphs was symbolically important: it demonstrated that Unicode was committed to encoding not just living languages but the full breadth of human writing history.

2010 -- 2020: The Emoji Era

How Emoji Changed Unicode Forever

Unicode 6.0 (2010) was a turning point. Japanese mobile carriers (NTT DoCoMo, KDDI, SoftBank) had been using proprietary emoji sets since the late 1990s, and when Apple added emoji support to iOS in 2008 (initially for the Japanese market), demand exploded globally. Google and Apple petitioned the Unicode Consortium to standardize emoji, arguing that interoperability required a single universal encoding.

The Consortium agreed, and 722 emoji were added in Unicode 6.0. This decision transformed Unicode from an obscure technical standard into a cultural phenomenon. Suddenly, debates about which new emoji to encode made international headlines, and the Consortium's annual emoji announcements became mainstream news events.

Emoji Skin Tones and ZWJ Sequences

Unicode 8.0 (2015) introduced skin tone modifiers (based on the Fitzpatrick dermatology scale), allowing five skin tone variants for human emoji. Unicode 11.0 (2018) and later versions expanded ZWJ (Zero Width Joiner) sequences, enabling combinations like family groups (👨\u200D👩\u200D👧\u200D👦), professions (👩\u200D🔬 "woman scientist"), and flags.

ZWJ sequences are significant because they allow new "characters" to be created by combining existing code points, without consuming new code point slots. This mechanism has become the primary way new emoji are introduced.

The Emoji Subcommittee

The enormous public interest in emoji led the Consortium to create a dedicated Emoji Subcommittee (now called the Emoji Standing Committee) to manage proposals. The committee evaluates proposals based on criteria including expected usage frequency, distinctiveness, and compatibility. Thousands of emoji proposals are submitted each year; only a few dozen are approved per release.

2020 -- Present: Maturity and Ongoing Challenges

Recent Milestones

Version Year Total Characters Notable Additions
13.0 2020 143,859 Yezidi, Chorasmian, Elymaic, Nandinagari, Wancho, 55 new emoji
14.0 2021 144,697 Toto, Cypro-Minoan, Old Uyghur, Tangsa, Vithkuqi, 37 new emoji
15.0 2022 149,186 Kawi, Nag Mundari, 31 new emoji, CJK Extension I (622 ideographs)
15.1 2023 149,813 118 new CJK ideographs, no new emoji (maintenance release)
16.0 2024 154,998 Garay, Gurung Khema, Kirat Rai, Ol Onal, Sunuwar, Todhri, Tulu-Tigalari, Egyptian Hieroglyphs Extended-A (991 characters), 7 new emoji

Unicode 16.0 brought the total to nearly 155,000 characters covering 168 scripts. The standard now encodes scripts ranging from the oldest known writing (Sumerian Cuneiform, ~3400 BCE) to scripts that were invented in the 21st century (Adlam, created around 2011 for the Fulani language).

Ongoing Controversies and Challenges

CJK Unification remains contentious. While the technical arguments for unification are sound (it avoids duplicating tens of thousands of code points), users in China, Japan, and Korea continue to argue that locale-specific glyph variants should be encoded separately rather than relying on font selection.

Script equity is another concern. European scripts were encoded first and most completely, while many African, Southeast Asian, and indigenous scripts waited decades. Organizations like the Script Encoding Initiative (based at UC Berkeley) work to prepare proposals for underrepresented scripts.

Emoji governance has attracted criticism from those who feel the Consortium spends disproportionate resources on emoji at the expense of script encoding. The Consortium has pushed back, noting that emoji generate significant public interest and funding that supports all encoding work.

Security is an ongoing battle. Unicode's enormous repertoire enables homograph attacks (using look-alike characters from different scripts to spoof domain names) and invisible text attacks (using zero-width characters to hide content). The Unicode Consortium publishes Technical Reports on security considerations, but the arms race between attackers and defenders continues.

The People Behind Unicode

While Unicode is a collective effort, several individuals deserve special recognition:

  • Joe Becker (Xerox) -- conceived the original Unicode idea and wrote "Unicode 88"
  • Lee Collins (Xerox) -- co-designed the first specification
  • Mark Davis (Apple, later Google) -- co-founder, long-serving president of the Consortium, primary author of many Unicode algorithms (CLDR, ICU)
  • Ken Whistler -- has worked on Unicode since the earliest days, maintains the Unicode Character Database
  • Asmus Freytag -- major contributor to Unicode algorithms and the ISO 10646 synchronization
  • Ken Thompson and Rob Pike -- designed UTF-8 at Bell Labs in 1992
  • Michael Everson -- prolific proposer of scripts, responsible for dozens of historic and minority script additions

Key Takeaways

  • Unicode was born from the encoding chaos of the 1980s, when hundreds of incompatible character sets made international text interchange nearly impossible.
  • The standard was created through collaboration between engineers at Xerox, Apple, Sun, and Microsoft starting in 1987, and the Unicode Consortium was incorporated in 1991.
  • The original assumption that 65,536 code points would suffice was wrong; Unicode 2.0 (1996) expanded the code space to 1,114,112 code points across 17 planes.
  • UTF-8 -- designed by Ken Thompson and Rob Pike in 1992 -- won the encoding war thanks to its backward compatibility with ASCII, and today accounts for over 98% of web pages.
  • The addition of emoji starting in Unicode 6.0 (2010) transformed the Consortium from a niche standards body into a cultural institution.
  • As of Unicode 16.0 (2024), the standard defines nearly 155,000 characters across 168 scripts, with ongoing work to encode the remaining unrepresented writing systems of the world.
  • The synchronization between Unicode and ISO/IEC 10646 ensures that there is truly one universal character set, even though the two standards are maintained by different organizations.

เพิ่มเติมใน Unicode Fundamentals

What is Unicode? A Complete Guide

Unicode is the universal character encoding standard that assigns a unique number …

UTF-8 Encoding Explained

UTF-8 is the dominant character encoding on the web, capable of representing …

UTF-8 vs UTF-16 vs UTF-32: When to Use Each

UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …

What is a Unicode Code Point?

A Unicode code point is the unique number assigned to each character …

Unicode Planes and the BMP

Unicode is divided into 17 planes, each containing up to 65,536 code …

Understanding Byte Order Mark (BOM)

The Byte Order Mark (BOM) is a special Unicode character used at …

Surrogate Pairs Explained

Surrogate pairs are a mechanism in UTF-16 that allows code points outside …

ASCII to Unicode: The Evolution of Character Encoding

ASCII defined 128 characters for the English alphabet and was the foundation …

Unicode Normalization: NFC, NFD, NFKC, NFKD

The same visible character can be represented by multiple different byte sequences …

The Unicode Bidirectional Algorithm

The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …

Unicode General Categories Explained

Every Unicode character belongs to a general category such as Letter, Number, …

Understanding Unicode Blocks

Unicode blocks are contiguous ranges of code points grouped by script or …

Unicode Scripts: How Writing Systems are Organized

Unicode assigns every character to a script property that identifies the writing …

What are Combining Characters?

Combining characters are Unicode code points that attach to a preceding base …

Grapheme Clusters vs Code Points

A single visible character on screen — called a grapheme cluster — …

Unicode Confusables: A Security Guide

Unicode confusables are characters that look identical or nearly identical to others, …

Zero Width Characters: What They Are and Why They Matter

Zero-width characters are invisible Unicode code points that affect text layout, joining, …

Unicode Whitespace Characters Guide

Unicode defines over two dozen whitespace characters beyond the ordinary space, including …

Unicode Versions Timeline

Unicode has released major versions regularly since 1.0 in 1991, with each …