The Unicode Odyssey · Bab 2

The Solution: How Unicode Works

Unicode assigns a unique number — a code point — to every character in every language. This chapter explains the elegant structure of planes, blocks, and the Basic Multilingual Plane.

~4.000 kata · ~16 menit baca · · Updated

If you've ever wondered how a single standard manages to contain Latin letters, Arabic script, ancient cuneiform, musical notation, alchemical symbols, and every emoji ever approved by a committee of linguists and tech company representatives — the answer lies in Unicode's elegant core architecture. Understanding that architecture turns Unicode from a mysterious black box into a comprehensible (if vast) system.

The Codepoint: Unicode's Fundamental Unit

The foundational concept in Unicode is the codepoint — an integer that serves as a unique identifier for a specific character or element. Every character in Unicode has exactly one codepoint (with some important nuances we'll address shortly), and every codepoint is written in the conventional format U+XXXX, where XXXX is the hexadecimal representation of the integer.

Some examples:

Character Codepoint Name
A U+0041 LATIN CAPITAL LETTER A
é U+00E9 LATIN SMALL LETTER E WITH ACUTE
U+20AC EURO SIGN
U+4E2D CJK UNIFIED IDEOGRAPH-4E2D
😀 U+1F600 GRINNING FACE
𓂀 U+13080 EGYPTIAN HIEROGLYPH D004

Notice that codepoints aren't all four digits — the last two examples use five and six hex digits respectively. That's because Unicode's codepoint space extends far beyond what four hex digits can represent.

The Codepoint Space: 1,114,112 Positions

Unicode defines a codepoint space from U+0000 to U+10FFFF — a total of 1,114,112 positions (that's 0x110000 in hex). Not all of these are assigned to characters; the Unicode Consortium has been deliberately conservative about assignments, leaving large regions unassigned to allow for future use. As of Unicode 16.0, approximately 155,000 characters are assigned, leaving over 950,000 positions available.

The upper limit of U+10FFFF wasn't chosen arbitrarily. It's the maximum value that can be encoded using the UTF-16 surrogate pair mechanism (covered in depth in the next chapter), which processes codepoints in blocks of 1,024 × 1,024 = 1,048,576 supplementary positions plus 65,536 basic positions, totaling exactly 1,114,112.

Planes: Organizing 1.1 Million Codepoints

The codepoint space is divided into 17 planes, each containing 65,536 (0x10000) codepoints:

Plane Range Name Contents
0 U+0000–U+FFFF Basic Multilingual Plane (BMP) Most modern scripts, common symbols
1 U+10000–U+1FFFF Supplementary Multilingual Plane (SMP) Historic scripts, musical notation, emoji
2 U+20000–U+2FFFF Supplementary Ideographic Plane (SIP) Rare CJK unified ideographs
3 U+30000–U+3FFFF Tertiary Ideographic Plane (TIP) Very rare/archaic CJK
4–13 U+40000–U+DFFFF Unassigned Reserved for future use
14 U+E0000–U+EFFFF Supplementary Special-purpose Plane Language tags, variation selectors
15–16 U+F0000–U+10FFFF Private Use Areas (PUA) User/vendor-defined characters

The Basic Multilingual Plane (BMP)

The BMP is where the action is. It contains essentially all characters needed for modern text processing: the complete Latin, Greek, Cyrillic, Arabic, Hebrew, Devanagari, CJK, and dozens of other modern scripts. When Unicode was designed, the BMP was intended to contain all necessary characters — the supplementary planes were an insurance policy.

That insurance policy turned out to be necessary. Rare Chinese, Japanese, and Korean ideographs exceeded what the BMP could hold, and emoji (which nobody anticipated in the 1980s) ended up largely in the SMP.

The Supplementary Multilingual Plane (SMP)

The SMP is a fascinating museum of human writing and notation:

  • Linear B (U+10000–U+1007F): The ancient Greek script that preceded the modern Greek alphabet, used in Mycenaean administrative records
  • Egyptian Hieroglyphs (U+13000–U+1342F): Over 1,000 hieroglyphs
  • Musical Notation (U+1D100–U+1D1FF): Western staff notation symbols
  • Mathematical Alphanumeric Symbols (U+1D400–U+1D7FF): Mathematical bold, italic, script, Fraktur, double-struck variants of Latin and Greek letters
  • Emoji (scattered through U+1F300–U+1FAFF and beyond): Faces, animals, objects, flags, and more

Blocks: Neighborhoods Within Planes

Within planes, codepoints are organized into blocks — contiguous ranges assigned to a particular script or category. The Unicode Standard defines over 300 named blocks. Some examples:

Block Range Size
Basic Latin U+0000–U+007F 128
Latin-1 Supplement U+0080–U+00FF 128
Latin Extended-A U+0100–U+017F 128
Arabic U+0600–U+06FF 256
CJK Unified Ideographs U+4E00–U+9FFF 20,992
Hangul Syllables U+AC00–U+D7A3 11,172
Emoticons U+1F600–U+1F64F 80

Blocks are a navigational convenience, not a semantic constraint. A character's behavior is determined by its properties (described below), not by which block it falls in.

Unicode Categories: The Grammar of Characters

Every Unicode character belongs to a General Category, which determines how it behaves in text processing. Categories are organized in a two-level hierarchy:

Major categories: - L — Letter - M — Mark (combining character) - N — Number - P — Punctuation - S — Symbol - Z — Separator - C — Other

Subcategories (selected):

Code Name Example
Lu Uppercase Letter A, Γ, Д
Ll Lowercase Letter a, γ, д
Lo Other Letter 中, ก, ا
Mn Non-spacing Mark U+0301 (combining acute accent)
Nd Decimal Digit Number 0-9, ٠-٩ (Arabic-Indic), ০-৯ (Bengali)
Ps Open Punctuation (, [, {
Pe Close Punctuation ), ], }
Sm Math Symbol +, =, ∑, ∞
So Other Symbol ©, ™, ♠
Zs Space Separator U+0020, U+00A0 (non-breaking), U+3000 (ideographic space)
Cc Control U+0000–U+001F (tab, newline, etc.)
Cf Format U+200B (zero-width space), U+FEFF (BOM)
Co Private Use U+E000–U+F8FF

These categories matter enormously for text processing. Regex engines use them (\p{Lu} matches any uppercase letter in languages that support Unicode property escapes). Word-break algorithms use them. Case conversion, collation sorting, and bidirectional display all depend on category assignments.

Unicode Properties: Beyond Category

Category is just one of over 100 properties that Unicode defines for each character. Some particularly important ones:

Script: Which writing system does this character belong to? (Latin, Arabic, Han, Devanagari, etc.)

Bidi_Class: How does this character behave in bidirectional text layout? (Left-to-Right, Right-to-Left, Neutral, etc.)

Case_Folding: What is the case-insensitive equivalent? (More nuanced than simple uppercase/lowercase)

Decomposition: Does this character decompose into simpler components? (é = e + combining acute accent)

Line_Break: How should line-breaking algorithms treat this character?

Numeric_Value: For characters that represent numbers, what is their numeric value? (Useful for digits in various scripts — the Bengali digit ৭ has Numeric_Value 7)

Age: In which version of Unicode was this character first assigned? (U+1F600, the grinning face, has Age 6.1)

Characters vs. Glyphs: A Crucial Distinction

One of the most important concepts in Unicode is the distinction between characters and glyphs.

A character is an abstract unit of meaning — an entry in the Unicode database with a codepoint, a name, and properties. The character U+0041 LATIN CAPITAL LETTER A is an abstract entity.

A glyph is the visual representation of a character as rendered by a specific font. The glyph for U+0041 in Times New Roman looks different from its glyph in Helvetica, in Comic Sans, or in a handwriting font — but they all represent the same character.

This distinction matters for several reasons:

  • One character, many glyphs: The character U+0041 can have thousands of different visual representations across fonts.
  • Multiple characters, one glyph: Ligatures are single glyphs that represent multiple characters. The "fi" ligature combines f (U+0066) and i (U+0069) into one visual form — two codepoints, one glyph.
  • Font coverage: Not every font contains glyphs for every Unicode character. A font might cover Latin perfectly but have no glyphs for Devanagari, causing boxes or question marks to appear.
  • Rendering complexity: Arabic characters have different shapes depending on whether they appear at the beginning, middle, or end of a word. A single character may have four different glyphs (initial, medial, final, isolated).

Scalar Values and Surrogates

A technical distinction worth understanding: not all codepoints in the 0x0000–0x10FFFF range are valid scalar values (the term Unicode uses for "usable character codepoints"). The range U+D800–U+DFFF is reserved for UTF-16 surrogate pairs — these 2,048 codepoints are not assigned to characters and cannot appear in valid Unicode text. They exist solely as a UTF-16 encoding mechanism.

Valid Unicode scalar values are: U+0000–U+D7FF and U+E000–U+10FFFF.

The Private Use Area (PUA)

Three regions of the codepoint space are designated as Private Use Areas:

  • BMP PUA: U+E000–U+F8FF (6,400 codepoints)
  • Plane 15 PUA: U+F0000–U+FFFFF (65,534 codepoints)
  • Plane 16 PUA: U+100000–U+10FFFF (65,534 codepoints)

These are intentionally unassigned by the Unicode Consortium. Organizations can use them for proprietary characters — Apple used the BMP PUA extensively for its Apple logo (U+F8FF still renders as in macOS/iOS fonts) and various UI icons. The downside: PUA characters have no universal meaning, so text using them is not portable between systems that haven't agreed on the same PUA assignments.

How 1,114,112 Positions Are Organized in Practice

To give a sense of scale: if you printed every assigned Unicode character in a book, one per page, using Unicode 16.0's ~155,000 characters, the book would be over 400 pages long in fine print. The unassigned codepoint space — over 950,000 positions — is a vast terra incognita reserved for writing systems yet to be documented, characters yet to be recognized, and use cases yet to be imagined.

The 17-plane, 1,114,112-position architecture isn't just a technical specification — it's a statement about the scope of human linguistic diversity. It says: we expect to need to represent at least this many distinct characters, and we've built a system capacious enough to accommodate them all, now and for the foreseeable future.

In the next chapter, we'll explore how these abstract codepoint numbers get turned into actual bytes on disk and wire — the encoding question that gave rise to UTF-8, UTF-16, and UTF-32, and the fascinating engineering trade-offs each one makes.