Unicode Today and Tomorrow — The Encoding Wars

Unicode 16.0, released in September 2024, contains 154,998 assigned characters. These span 168 scripts, from the ancient Phoenician alphabet that gave rise to most of the world's writing systems to Gurung Khema, a script used for the Gurung language of Nepal that was encoded for the first time in this release. The 1,114,112-codepoint code space — U+0000 through U+10FFFF, organized in 17 planes of 65,536 code points each — is approximately 14% utilized, leaving room for roughly 960,000 additional characters. The Encoding Wars, in their original sense — the battle between incompatible proprietary systems for digital supremacy — have been decisively won by Unicode. But the peace that followed has its own tensions, its own unsolved problems, and its own frontiers.

Understanding where Unicode stands today means understanding both what it has accomplished and what remains genuinely difficult. The 154,998-character figure represents decades of scholarly work, political negotiation, cultural compromise, and technical innovation. The 960,000 unassigned code points are not empty space waiting to be filled arbitrarily — they represent the boundary between what humanity has encoded and what it has not yet been able to agree on, understand well enough, or even document.

The Landscape of Unicode 16.0

The 168 scripts in Unicode 16.0 span a vast range of historical periods, geographical distributions, and current vitality. At one end of the spectrum are the actively used writing systems with billions of daily users: Latin (used to write English, Spanish, French, German, Portuguese, Italian, and dozens of other languages), CJK Unified Ideographs (the Chinese characters used to write Chinese, Japanese, and Korean), Devanagari (Hindi, Sanskrit, and related languages), Arabic (Arabic and dozens of related languages), Cyrillic (Russian, Ukrainian, Bulgarian, Serbian, and others). These scripts have been in Unicode since version 1.0 and receive incremental additions as new characters are documented.

At the other end of the spectrum are historical scripts encoded for scholarly and archival purposes, with few or no living speakers or writers. Linear B, the syllabic writing system used for an early form of Greek and deciphered from Bronze Age tablets in 1952, was encoded in Unicode 4.0 (2003). Egyptian Hieroglyphs, the writing system of ancient Egypt used for over 3,000 years, was encoded in Unicode 5.2 (2009). Cuneiform, the wedge-shaped writing system used across ancient Mesopotamia for nearly 3,000 years, was encoded in Unicode 5.0 (2006). The presence of these scripts in Unicode enables scholarly databases, digital editions of ancient texts, and educational resources that were previously limited to specialized fonts and non-interoperable encodings.

Between these extremes are the minority and endangered living scripts — writing systems for languages with hundreds of thousands or millions of speakers but limited digital infrastructure. The Rohingya people of Myanmar developed the Hanifi Rohingya script in the 20th century as a way of writing their language without dependence on Arabic, Burmese, or Latin scripts. It was encoded in Unicode 11.0 (2018). The Gurung (Tamu) people of Nepal had their own script, Gurung Khema, which was encoded in Unicode 16.0. Each of these additions represents advocacy work by community members, scholarly documentation by linguists, and formal review by the Unicode Technical Committee.

CJK: The Ongoing Frontier

The CJK (Chinese-Japanese-Korean) Unified Ideographs represent both Unicode's greatest triumph and its most persistent challenge. The main CJK block (U+4E00 to U+9FFF) contains 20,902 characters, assigned in Unicode 1.0. Extension blocks have been added in every major Unicode version since: Extension A (6,582 characters, Unicode 3.0), Extension B (42,711 characters, Unicode 3.1), Extension C (4,149 characters, Unicode 6.0), Extension D (222 characters, Unicode 6.3), Extension E (5,762 characters, Unicode 8.0), Extension F (7,473 characters, Unicode 10.0), Extension G (4,939 characters, Unicode 13.0), Extension H (4,192 characters, Unicode 15.0), and Extension I (622 characters, Unicode 16.0). The total CJK unified ideograph count exceeds 97,000 characters.

Each CJK extension addition represents characters found in historical dictionaries, classical literature, specialized technical terminology, and regional variants. The Kangxi Dictionary (1716), the standard reference for classical Chinese characters, contains 47,035 entries. The Dai Kanwa Jiten (1955-1960), the comprehensive Japanese character dictionary, contains approximately 50,000 entries. Many of these characters are rare — used in fewer than a handful of surviving texts — but scholarship requires their representation, and once a character has appeared in a text that linguists study, Unicode is responsible for encoding it.

The Han unification debate persists in specific high-stakes contexts. The most practical issue is in electronic government systems in East Asia. Japan's government maintains official standards for how thousands of characters are to be rendered in government documents — the "Koseki Moji" standards for family registry characters, which specify precise stroke forms that differ from both simplified Chinese and generic Unicode rendering. Taiwan and the People's Republic of China maintain different official standards for the same characters. Software that renders Han characters for government use must select glyphs based on the national context, which requires language-tagging infrastructure that is not uniformly implemented.

The ongoing solution is the development of better font technology. The OpenType specification's locl (locale-specific forms) feature allows fonts to substitute different glyph shapes for the same code point based on a language tag in the text. When text is tagged as Japanese (language code JPN), a compliant font can serve Japanese-standard glyph forms; when tagged as Simplified Chinese (ZHS), it serves Chinese-standard forms. This requires both correctly tagged text and compliant fonts, and the infrastructure for both is improving but not yet universal.

Emoji Governance in the 2020s

The annual emoji addition cycle has evolved from a simple technical process into something resembling cultural policy governance. The Unicode Emoji Subcommittee, established in 2015, evaluates proposals against published criteria: expected usage frequency (measured partly by social media survey data), image distinctiveness (can the proposed emoji be visually distinguished from existing emoji at small sizes), available design approaches (can it be rendered consistently across platforms), and completeness (if the proposed emoji fits a category that has a limited set of members, should the whole category be encoded together).

These criteria are applied to a growing volume of proposals. Unicode receives hundreds of emoji proposals annually. Most are rejected, and the reasons for rejection have become a subject of public discussion, sometimes controversy. Proposals representing foods popular in specific regions but unknown globally may be rejected on expected-usage grounds — a decision that critics argue perpetuates the dominance of already-digitally-represented cultures. Proposals for flags of territories with contested political status raise questions about whether the Unicode Consortium, as a private organization, should effectively be making political recognitions through its character encoding decisions.

The major platform vendors — Apple, Google, Meta, Microsoft, Samsung, X (formerly Twitter) — are all Unicode Consortium voting members and participate in emoji decisions. Their commercial interests in emoji are substantial: popular emoji drive user engagement on platforms, and platforms compete to implement newly announced emoji quickly as a product differentiator. Apple typically releases iOS updates including newly approved Unicode emoji within months of the Unicode version's publication. The speed of implementation has created a product-development rhythm around Unicode releases that is unusual for a standards body.

AI and Unicode: A New Frontier

The rapid growth of large language models (LLMs) from 2020 onward has introduced a dimension to Unicode's importance that the consortium's founders could not have anticipated. LLMs don't process text as Unicode code points; they process text as tokens, produced by a tokenization algorithm — typically Byte Pair Encoding (BPE) or SentencePiece — trained on large text corpora. The token vocabulary might contain 50,000 to 200,000 tokens, each representing a frequently occurring sequence of characters.

Because the text corpora on which tokenizers are trained are heavily skewed toward English (and secondarily toward other Western European languages), English words tend to receive short token sequences — often a single token for a common word. Words in languages with less representation in training data, or in scripts that require more bytes per character in UTF-8, are tokenized into longer sequences. A word in Thai or Amharic might be tokenized into 10-20 tokens while a comparable English word gets 1-2 tokens.

The computational consequence is that LLMs effectively process multilingual text with different efficiencies. In a transformer architecture, every token costs the same computation: attention is O(n²) in sequence length, and more tokens mean more computation and memory. Users querying LLMs in underrepresented languages pay a real computational cost — and in systems that charge per token, a real financial cost — compared to English speakers. A 1,000-token English query might require 3,000-5,000 tokens to express the same content in a language with poor tokenization coverage. The encoding system designed to give all languages equal representation at the character level has been adopted into a computational framework that still privileges some languages over others.

The Unicode properties attached to characters — script, category, bidirectional class, numeric value, combining class — are potentially useful for building better tokenizers and for guiding LLM reasoning about text. Some research groups have explored tokenizers that use Unicode character properties explicitly, creating token boundaries at Unicode word-break positions or treating all characters in a script as belonging to the same token class. These approaches show promise for improving multilingual performance but are not yet mainstream.

A second AI-Unicode interaction involves character-level reasoning. LLMs are known to struggle with tasks requiring awareness of individual characters: counting the letters in a word, finding anagrams, identifying spelling errors character by character. This failure is directly attributable to tokenization: an LLM that represents the word "strawberry" as two or three tokens doesn't "know" that it contains three R's, because the character composition is not directly represented in its input. Byte-level or character-level models avoid this failure by tokenizing at finer granularity, at the cost of longer token sequences. The tradeoff between tokenization granularity and sequence length is an active research problem, and Unicode's character properties are part of the solution space.

What Remains: Unencoded Scripts and Characters

Despite 97,000+ CJK characters and scripts ranging from Proto-Sinaitic to Zanabazar Square, significant writing systems remain unencoded or incompletely encoded in Unicode. The Unicode Script Ad Hoc Group maintains a registry of proposed scripts under consideration, which includes several categories of cases.

Some scripts are well-documented but require extensive technical work before encoding. The Indus Valley script, used in the Bronze Age Indus Valley Civilization roughly 2600-1900 BCE, consists of several hundred distinct signs on thousands of seals and tablets — but it has not been deciphered, and there is scholarly debate about whether it is a full writing system, a partial writing system, or a proto-writing system. Until the Indus script is better understood, encoding it in Unicode would assign code points without knowing what they represent.

Some scripts are actively used by small communities and have been proposed but not yet completed the review cycle. The Coptic Bohairic dialect uses characters not yet encoded. Various regional scripts for African languages remain proposed but not standardized. The process of moving from "proposed" to "accepted" requires community documentation, scholarly review, and sometimes years of committee discussion.

Constructed scripts — scripts created by individuals for constructed languages or artistic purposes — occupy an ambiguous position. Tengwar and Cirth, J.R.R. Tolkien's fictional scripts for Elvish languages, are used by a substantial community of Tolkien enthusiasts and have been proposed multiple times. The Klingon script, used in Star Trek media, has been proposed. These scripts are genuinely used for written communication by real communities. The question of whether Unicode should encode scripts for fictional or constructed languages — which by definition lack the "historical or current use in human communication" criterion — has been debated but not definitively resolved. Several constructed scripts, including Blissymbols and Shavian, have been encoded on the grounds that they were seriously proposed for natural-language writing and have small but real user communities.

The Consortium's Evolving Role

The Unicode Consortium, founded in 1991 with a specific technical mission, now governs a standard that affects cultural representation, language documentation, political recognition (through flags and script choices), and the economics of AI systems. Its governance structure — voting membership weighted by financial contribution, with full voting rights for large technology companies and non-voting roles for governments and civil society organizations — was designed for a technical standards body, not for a global cultural institution.

Pressure to broaden governance has been building. Linguistic communities whose scripts are being encoded have an obvious stake in encoding decisions but limited formal standing in the process. The emoji governance controversy has made visible how decisions about which expressions are included in the universal character standard can reinforce or challenge cultural hierarchies. Academic researchers who study writing systems and scripts are crucial contributors to the technical work but secondary to commercial members in formal governance.

The consortium has responded incrementally. The Unicode Script Ad Hoc Group includes academic experts. The Emoji Subcommittee has expanded its consultation processes. The consortium's leadership has acknowledged publicly that governance for a truly global standard should be more inclusive. But changing the governance structure of an organization whose decisions affect multi-billion-dollar product lines is slow and politically complex.

The encoding wars began in 1838, with Morse code. They ended, in the sense of the original conflict, in the late 1990s when UTF-8's adoption made Unicode's victory inevitable. What has followed is not peace but a different kind of ongoing negotiation — between languages and cultures and commercial interests and scholarly traditions, all fighting for representation in the 1,114,112-position space that Ken Thompson and Rob Pike's elegant encoding makes accessible to the world. The war is over. The work of deciding what it means to represent all of human writing continues, and will continue as long as humans write.