The Unicode Odyssey · Глава 10
The Future of Unicode
Unicode continues to evolve with new characters, scripts, and emoji. This chapter looks at Unicode 16.0+, the emoji submission process, AI and Unicode, and the quest to encode the world's undeciphered scripts.
Unicode is not a finished project — it's a living standard maintained by an active consortium, released on a roughly annual schedule, and continuously grappling with challenges that didn't exist when it was created. The future of Unicode is shaped by the momentum of human communication, the political dynamics of script standardization, the explosive growth of emoji culture, the demands of artificial intelligence, and some genuinely hard unsolved problems. Here is where the standard is heading.
Unicode 16.0 and the Release Cadence
Unicode 16.0, released in September 2024, added approximately 5,185 new characters, bringing the total to around 155,000 assigned codepoints. The release included:
- Garay script: A newly documented script used for Fula languages in West Africa, representing a community actively using a script created in the 20th century
- Kirat Rai: A script used in Nepal
- Sunuwar: Another Nepalese script
- Todhri: An Albanian script historically used for liturgical texts
- Additional CJK Unified Ideographs: Continuing the expansion of CJK Extension blocks
- New emoji: Including several symbols reflecting modern culture and accessibility needs
The annual release cycle is a deliberate choice by the Unicode Consortium to balance the need for stability (software and fonts need time to implement new characters) with responsiveness to legitimate encoding requests. Proposals go through a multi-year process: initial submission, evaluation by the Script Encoding Initiative or emoji subcommittee, Unicode Technical Committee review, balloting, and finally publication.
The Emoji Explosion: A Story of Unintended Consequences
When Unicode 6.0 (2010) first standardized a set of emoji from the Japanese NTT DoCoMo and SoftBank character sets, the Unicode Consortium couldn't have anticipated that emoji would become one of the most active areas of Unicode development — or one of its most politically fraught.
The emoji approval process, managed by the Unicode Emoji Subcommittee, receives hundreds of proposals annually. Acceptance criteria include:
- Frequency of use: Is there evidence the proposed emoji would be widely used?
- Distinctiveness: Is the emoji different enough from existing ones?
- Completeness: Does it fill a gap in a related set?
- Image compatibility: Can it be rendered clearly at small sizes?
Critics argue the criteria have been applied inconsistently, resulting in some emoji being approved (🫙 jar, 🫶 heart hands) while others with strong use cases are rejected. The process is also slow — proposals can take three or more years from submission to standard.
The ZWJ sequence mechanism has become the primary path for expanding emoji without assigning new codepoints: existing base emoji combined with ZWJ create new semantic entities. Gender variations (👩💻 woman technologist), skin tones (five modifier codepoints), and family compositions all use this mechanism. It's clever engineering, but it means that emoji "characters" are increasingly sequences rather than single codepoints — complicating counting, display, and processing.
A harder question looms: emoji have grown to over 3,600 distinct entities (counting all variations). At what point does emoji growth become a maintenance burden exceeding its benefit? The Unicode Consortium has signaled it will slow emoji additions, focusing on completeness in existing sets rather than perpetual expansion.
New Scripts: The Ongoing Documentation Project
The Unicode Standard's most valuable long-term contribution may be its role in preserving scripts that would otherwise be lost to digitization gaps. Several categories of scripts continue to be added:
Minority and endangered language scripts: Many languages spoken by small communities have traditional writing systems that have been excluded from computing infrastructure, effectively limiting digital literacy in those communities. The Script Encoding Initiative (SEI), funded by Google and other organizations, works with communities worldwide to document and propose such scripts.
Historical scripts: Archaeologists continue to discover or re-examine ancient inscriptions. Proto-Elamite, Mayan hieroglyphics, and other systems are at various stages of proposal and study. Rongorongo (Easter Island) and Proto-Sinaitic remain undeciphered, making encoding impossible — you can't encode characters whose meaning is unknown.
Modern invented scripts: Some scripts created in the 20th or 21st century have achieved sufficient usage to qualify for encoding. The Cherokee syllabary (encoded in Unicode 1.0) and Deseret alphabet are examples; proposals for others appear periodically.
CJK unification ongoing: The debate over Han unification — whether visually similar characters in Chinese, Japanese, and Korean writing traditions should share codepoints — is not resolved and may never be fully settled. Each Unicode release adds more CJK characters to resolve specific gaps identified by scholars and publishers.
Unicode and Artificial Intelligence
The relationship between Unicode and AI is complex and increasingly important:
Tokenization: Large language models (LLMs) operate on tokens, not characters. Most modern tokenizers (BPE — Byte Pair Encoding, WordPiece, SentencePiece) work at the UTF-8 byte or subword level. The interaction between Unicode normalization and tokenization is non-trivial: the same word in different normalizations may produce different token sequences, affecting model behavior.
Multilingual models: Training models on multilingual corpora requires handling the full Unicode range correctly. Models trained predominantly on English data have well-documented weaknesses for scripts with smaller representation in training data — a reflection of the imbalance in digital text across languages.
Character-level models: Models that operate at the character level must handle the gap between codepoints and grapheme clusters correctly, or they learn incorrect character boundaries.
Unicode and hallucination: LLMs have been observed to hallucinate characters — generating sequences that are not valid Unicode, or generating characters that look plausible but don't exist. Understanding Unicode's structure helps diagnose these failures.
CLDR's role in AI: The Unicode Common Locale Data Repository (CLDR) provides locale-specific data for pluralization rules, number formats, date formats, and other language conventions. As LLMs are increasingly used for localization tasks, CLDR becomes relevant infrastructure for ensuring model outputs conform to locale conventions.
The CLDR: Unicode's Underappreciated Twin
The Common Locale Data Repository is a Unicode project distinct from the character standard itself, but essential to i18n (internationalization). CLDR contains:
- Locale data for 900+ languages and regions: number formats, date formats, currency formats, sort orders, plural rules
- Language matching: Which locales are close enough to serve as fallbacks for each other?
- Likely subtags: Given a language code, what script and region are most likely?
- Transliteration rules: How to convert between scripts (Cyrillic to Latin, Arabic to Latin)
- Emoji annotation data: Human-readable names and keywords for emoji in 80+ languages
Every major operating system, browser, and programming framework uses CLDR data. When your browser formats a date as "February 25, 2026" for an English user and "25 febrero de 2026" for a Spanish user, it's CLDR providing the format strings and the translated month names.
ICU: The Implementation Layer
The International Components for Unicode library is the reference implementation of Unicode algorithms and CLDR data, available for C/C++, Java, and (through wrappers) many other languages. ICU implements:
- Collation (locale-aware string sorting)
- Date, time, and number formatting
- Text segmentation (grapheme clusters, words, sentences)
- Normalization
- BiDi algorithm
- Transliteration
- Regular expressions with full Unicode property support
ICU is maintained by the Unicode Consortium and forms the foundation of Unicode support in Android, macOS/iOS, many Linux distributions, Google Chrome, and countless enterprise applications. Its correctness is essential — a bug in ICU propagates to billions of devices.
Capacity: Are We Running Out of Space?
With ~155,000 characters assigned out of 1,114,112 available, Unicode has roughly 960,000 unassigned positions. At the current pace of allocation (roughly 5,000–10,000 characters per year, and likely slowing), the codepoint space will not be exhausted for centuries.
However, the BMP (Basic Multilingual Plane, U+0000–U+FFFF) is getting crowded. Only about 6,000 positions remain unassigned in the BMP after accounting for surrogates and reserved ranges. Future characters will increasingly go into supplementary planes, which means increasing reliance on surrogate pairs in UTF-16 systems — an ongoing source of software bugs in Java and JavaScript environments.
There is no serious proposal to extend Unicode beyond U+10FFFF. The current limit is large enough for any foreseeable need, and the cost of breaking UTF-16 compatibility would be enormous.
The Persistent Challenges
Some problems with Unicode have no clean solution on the horizon:
The emoji diversity problem: As emoji add skin tone modifiers, gender sequences, and disability sequences, the combinatorial explosion of valid emoji sequences grows. Tools that need to enumerate or display all emoji face a moving target.
Font rendering complexity: As more complex scripts are encoded (with contextual forms, conjuncts, bidirectional behavior), the shaping requirements on rendering engines grow. HarfBuzz, the dominant open-source text shaping engine, continuously adds script support, but coverage lags behind character encoding.
Machine-readable semantics: Unicode tells you what a character is, but not what it means in context. The word "apple" means the same thing in the encoded sense whether you're discussing fruit or technology — but applications that need semantic disambiguation get no help from Unicode.
The digitization gap: Despite 30+ years of work, a significant fraction of the world's writing systems remain poorly supported in digital typography. A language can have Unicode encoding but no good fonts, no spell-checkers, no keyboards — making digital literacy inaccessible to its community even in theory.
What Stays Constant
Amid the evolution, Unicode's core commitments remain stable:
- Codepoints never move: Once assigned, a character's codepoint is permanent. No character will be reassigned to a different meaning. This stability is the foundation that makes long-term data archival possible.
- Backward compatibility: New versions of Unicode add characters and properties but do not change the meaning of previously assigned characters.
- The encoding triarchy: UTF-8, UTF-16, and UTF-32 will remain the three encodings, with UTF-8 continuing to dominate for file storage and network transmission.
Unicode is 35 years old and more active than ever. The project of encoding human language in a universal, interoperable standard is far more complex than its founders anticipated — and far more successful than anyone had a right to expect. The journey through Unicode that this series has traced — from the chaos of code pages, through the architecture of codepoints and planes, through encoding algorithms and grapheme clusters and normalization and security — is ultimately a journey through the effort to represent human thought in a medium machines can share. That effort is, by its nature, never complete.