The Unicode Odyssey · Chapter 5

The World's Writing Systems in Unicode

From Latin to CJK, from right-to-left Arabic to vertical Mongolian — Unicode encodes the world's writing systems. This chapter surveys the diversity of scripts and the technical challenges each presents.

~3,500 words · ~14 min read · · Updated

One of Unicode's most extraordinary achievements is that it contains the writing systems of human civilizations spanning five millennia — from ancient Sumerian cuneiform to the emoji added in last year's release. Encoding that diversity required not just collecting characters but understanding the fundamentally different structures of the world's writing systems. A system designed only for alphabets would fail for syllabaries. A system designed only for left-to-right scripts would break for Arabic. Unicode had to generalize across all of them.

The Typology of Writing Systems

Linguists classify writing systems by what their symbols represent:

Alphabets represent individual phonemes (consonants and vowels). One symbol ≈ one sound. Latin, Greek, Cyrillic, Armenian, Georgian, and others fall into this category. The Latin script alone (with its many extended variants) covers most of Europe, the Americas, and large parts of Africa and Oceania.

Abjads represent only consonants; vowels are implied or written as optional diacritics. Arabic, Hebrew, Phoenician, and Aramaic are abjads. A reader of Arabic already knows the vowels from context; learners and formal texts use small diacritical marks (harakat) to make vowels explicit.

Abugidas (also called alphasyllabaries) use consonant letters as the base, with vowels written as mandatory modifications to the consonant. Devanagari (used for Hindi, Sanskrit, Nepali), Thai, Tibetan, and most South and Southeast Asian scripts are abugidas. The inherent vowel of each consonant letter can be changed or suppressed by attaching vowel signs.

Syllabaries use one symbol per syllable. Japanese kana (hiragana and katakana) are syllabaries, with 46 base symbols covering all Japanese syllables. Hangul, often described as a syllabary, is more accurately an alphabetic syllabary — each syllable block is assembled from individual phoneme letters (jamo) arranged in a fixed spatial pattern.

Logographies use symbols representing morphemes or words rather than sounds. Chinese characters (hànzì), inherited by Japanese (kanji) and historically by Korean and Vietnamese, are logographic — though in practice all these systems mix logographic elements with phonetic components.

Featural systems encode phonetic features directly in the shape of symbols. Korean hangul is the most prominent example: voiced consonants are visually related to their unvoiced counterparts, and consonant shapes systematically reflect articulatory position.

How Unicode Handles Alphabets

The Latin script block (U+0000–U+007F for basic Latin, U+0080–U+024F for extended forms) contains the characters needed for virtually all Latin-alphabet languages: accented vowels, consonants with cedillas and carons, and specialized letters used in phonetic transcription, African languages, and archaic European writing.

Unicode uses a per-script block approach: each script gets its own contiguous range of codepoints. Cyrillic occupies U+0400–U+04FF (and extensions at U+0500–U+052F, U+1C80–U+1C8F). Greek lives at U+0370–U+03FF. This organization makes it easy to identify which script a character belongs to and to subset fonts by script.

For alphabets, Unicode generally assigns separate codepoints for uppercase and lowercase variants. Case conversion (uppercase, lowercase, titlecase) is handled through Unicode properties — and it's more complex than a simple offset, since some languages have multi-character case equivalents (the German sharp S ß uppercases to SS in German, or ẞ in formal typography).

Arabic and Hebrew: Right-to-Left Scripts

Arabic (U+0600–U+06FF) and Hebrew (U+0590–U+05FF) present a fundamental challenge: they are written right-to-left, while Latin is left-to-right. Unicode doesn't reorder characters in memory to accommodate reading direction — text is always stored in logical order (the order you'd encounter characters when reading), and the rendering system is responsible for visual reordering.

Arabic introduces an additional complexity: most Arabic letters have up to four different shapes (isolated, initial, medial, final) depending on their position within a word. Unicode assigns each letter a single codepoint for its identity, and the OpenType font system uses contextual substitution rules to select the correct glyph. This separation of character identity from visual form is a core Unicode principle.

Arabic also has combining diacritical marks — the harakat vowel marks — that float above or below consonant letters. These are separate combining codepoints, just like Latin diacritics.

The Unicode Bidirectional Algorithm (UAX #9) handles text that mixes Latin (LTR) and Arabic (RTL). It takes a sequence of characters with their bidi properties (L = strongly left-to-right, R = strongly right-to-left, AN = Arabic number, EN = European number, etc.) and computes the visual display order. Getting bidi right is notoriously difficult; getting it wrong creates text that reads incorrectly or can be exploited for attacks (covered in the security chapter).

Devanagari: The Complexity of Abugidas

Devanagari (U+0900–U+097F) is used for Hindi, Sanskrit, Marathi, Nepali, and many other South Asian languages. It's an excellent example of abugida complexity.

Each consonant in Devanagari has an inherent vowel /a/. To write a consonant with a different vowel, a vowel sign (matra) is attached. To write a consonant with no following vowel, a virama (U+094D, the halant) is added. When two consonants occur together without an intervening vowel, they typically form a conjunct — a fused or modified form of both consonants.

For example, the syllable "kta" (क + virama + त + vowel a) can render as a conjunct form where k and t merge into a single visual shape. Unicode stores the logical sequence (k, virama, t) and relies on OpenType shaping engines (like HarfBuzz) to produce the correct conjunct rendering.

This means that the visual rendering of Devanagari text can look dramatically different from the byte sequence, and that accurate text rendering requires a sophisticated shaping engine — not just font rendering.

Hangul: Algorithmic Composition

Korean Hangul takes a unique approach in Unicode. The Unicode Standard encodes all 11,172 precomposed Hangul syllable blocks in the range U+AC00–U+D7A3 — but it also encodes the individual jamo (consonant and vowel components) separately.

The syllable structure is (onset consonant) + (vowel) + (optional coda consonant). Given the counts of each: - 19 onset consonants (choseong) - 21 vowels (jungseong) - 28 coda positions (jongseong, including the "no coda" case)

This gives 19 × 21 × 28 = 11,172 possible syllable blocks, exactly matching what's in U+AC00–U+D7A3. Moreover, the position of any syllable block can be calculated algorithmically:

syllable_index = (choseong_index * 21 + jungseong_index) * 28 + jongseong_index
codepoint = U+AC00 + syllable_index

This makes Hangul decomposition and composition purely computational — no lookup table needed. It's one of the most elegant pieces of Unicode's architecture.

CJK: The Unification Controversy

The CJK (Chinese-Japanese-Korean) Unified Ideographs block (U+4E00–U+9FFF, with large extensions) contains over 92,000 characters shared across Chinese, Japanese, and Korean. The "unification" refers to the controversial decision to assign a single Unicode codepoint to what are considered the "same" character across these three writing systems, even when the conventional written forms in each country differ slightly.

For example, the character meaning "nation" (国/國) has a simplified mainland Chinese form (国) and a traditional Chinese/Japanese/Korean form (國) that were assigned separate codepoints because they differ significantly. But many characters with smaller typographic differences between countries share a single codepoint, and the appropriate glyph is determined by the locale and the font.

This design decision — the Han Unification — remains controversial. Many Japanese and Chinese typographers and linguists argue that the "same" character in different writing traditions are culturally distinct entities deserving separate codepoints. The Unicode Consortium argues that unified encoding is linguistically correct and practically necessary, given that separate encoding would require tens of thousands of additional codepoints.

The practical consequence: high-quality CJK text rendering requires locale-appropriate fonts. Japanese text in a Chinese font, or vice versa, may display characters in visually foreign forms even if technically correct.

Writing Direction: A Property, Not a Layout

Unicode supports four writing directions:

Direction Examples
Left to Right (LTR) Latin, Cyrillic, Devanagari
Right to Left (RTL) Arabic, Hebrew
Top to Bottom, Right to Left columns Traditional Chinese/Japanese/Korean in vertical mode
Top to Bottom, Left to Right columns Mongolian script

Direction is not stored in the text data — it's determined by the script's bidi properties and by the rendering environment. OpenType fonts and CSS writing-mode properties handle vertical text layout.

Scripts Currently in Unicode

As of Unicode 16.0, the standard includes 168 scripts. A sampling:

Ancient/Historic: Cuneiform, Egyptian Hieroglyphs, Linear A, Linear B, Phoenician, Old Turkic, Runic, Gothic, Glagolitic, Coptic

Modern but less widely known: N'Ko, Tifinagh (Berber), Cherokee (uniquely an invention of a single person, Sequoyah, in the 19th century), Vai, Bamum, Mandaic

Actively being added: In each Unicode release, new scripts are added — typically lesser-documented minority languages, historical scripts from recent archaeological finds, or scripts that were previously omitted due to complexity or controversy.

What's Not in Unicode (Yet)

Despite its comprehensiveness, Unicode still lacks some scripts:

  • Rongorongo (Easter Island): Not yet deciphered, cannot be properly encoded
  • Proto-Sinaitic and Proto-Canaanite: Early Semitic scripts under scholarly debate
  • Khitan Large Script: Partially deciphered; encoding is in progress
  • Various signary systems: Some scripts used in limited contexts lack sufficient documentation for standardization

The process of adding a new script requires formal proposals, scholarly consensus on character identity, and Unicode Technical Committee approval — typically taking several years from proposal to inclusion.

The breadth of Unicode's script coverage is a testament to the global collaboration among linguists, typographers, software engineers, and community representatives that has sustained the Unicode project for over three decades. Each script block represents not just a technical specification but a commitment to preserving human linguistic heritage in the digital medium.