The Unicode Odyssey · الفصل 3

Encoding the Codepoints: UTF-8, UTF-16, UTF-32

Code points need to be encoded into bytes for storage and transmission. This chapter demystifies the three main Unicode encodings, their trade-offs, and why UTF-8 won the web.

~5000 كلمة · ~20 دقيقة قراءة · · Updated

Unicode defines what characters exist and gives each one a number. But computers store and transmit data as bytes, not as abstract integers. The question of how to convert Unicode codepoints into bytes — and back again — is the encoding problem, and it has three major solutions: UTF-8, UTF-16, and UTF-32. Each embodies different trade-offs, and understanding those trade-offs explains why UTF-8 dominates the modern web while UTF-16 lives inside Java and JavaScript string implementations.

Why Not Just Store the Number Directly?

The naive approach: store each codepoint as a 4-byte (32-bit) integer, since 32 bits can represent values up to 4,294,967,295, which is more than enough for U+10FFFF. This is essentially what UTF-32 does, and it has the virtue of simplicity — every character takes exactly 4 bytes, so indexing is O(1).

The problem: wasteful. ASCII characters (U+0000–U+007F) need only 7 bits, but UTF-32 wastes 25 bits on zeros for each one. An English-language document stored in UTF-32 is 4× larger than its ASCII equivalent. For a web serving billions of pages daily, that's an enormous cost.

Moreover, ASCII text is fundamentally incompatible with UTF-32: the byte sequence for 'A' in ASCII is 0x41, but in UTF-32-LE it's 0x41 0x00 0x00 0x00. Legacy software that processed ASCII can't process UTF-32 without modification.

UTF-32: Fixed Width, Maximum Simplicity

UTF-32 encodes each codepoint as exactly 4 bytes (32 bits). It comes in two byte orderings:

  • UTF-32-LE (Little-Endian): low byte first
  • UTF-32-BE (Big-Endian): high byte first

For the codepoint U+1F600 (😀, GRINNING FACE): - Integer value: 0x0001F600 - UTF-32-LE bytes: 00 F6 01 00 - UTF-32-BE bytes: 00 01 F6 00

The Byte Order Mark (BOM)

When a file doesn't specify its byte ordering, UTF-32 (and UTF-16) files often begin with a Byte Order Mark at codepoint U+FEFF. The BOM is a zero-width no-break space that happens to have an asymmetric byte pattern:

  • U+FEFF in UTF-32-BE: 00 00 FE FF
  • U+FEFF in UTF-32-LE: FF FE 00 00

A reader seeing 00 00 FE FF at the start knows the file is UTF-32-BE. Seeing FF FE 00 00 means UTF-32-LE. This elegant self-identification mechanism is baked into the encoding standard.

UTF-32 use cases: Internal string representations in some programming languages and runtimes, situations where O(1) codepoint indexing is critical and memory cost is acceptable. Rarely used for file storage or network transmission.

UTF-16: The Compromise That Became a Legacy

UTF-16 was designed when the Unicode Consortium believed all necessary characters would fit within the BMP (U+0000–U+FFFF). The idea: encode every character as exactly 2 bytes. Simple, efficient for Asian characters, only 2× overhead for ASCII compared to UTF-32.

For characters in the BMP (U+0000–U+FFFF), UTF-16 encoding is straightforward: the codepoint value is stored directly as a 16-bit integer. U+4E2D (中) in UTF-16-BE: 4E 2D.

Surrogate Pairs: The Complication

Then the BMP turned out not to be enough. When supplementary planes were added, UTF-16 needed a way to encode codepoints above U+FFFF. The solution was surrogate pairs — using two 16-bit code units to represent one codepoint.

The ranges U+D800–U+DBFF (1,024 values) are high surrogates, and U+DC00–U+DFFF (1,024 values) are low surrogates. A high surrogate followed by a low surrogate encodes a supplementary codepoint using this formula:

# For codepoint C where C > U+FFFF:
C_prime = C - 0x10000
high_surrogate = 0xD800 + (C_prime >> 10)     # top 10 bits
low_surrogate  = 0xDC00 + (C_prime & 0x3FF)   # bottom 10 bits

Example — encoding U+1F600 (😀):

C_prime = 0x1F600 - 0x10000 = 0xF600
high_surrogate = 0xD800 + (0xF600 >> 10)  = 0xD800 + 0x3D = 0xD83D
low_surrogate  = 0xDC00 + (0xF600 & 0x3FF) = 0xDC00 + 0x200 = 0xDE00

# Wait — let me redo: 0xF600 >> 10 = 0x3D (61 decimal), 0xF600 & 0x3FF = 0x200
# So: 0xD800 + 0x3D = 0xD83D (high), 0xDC00 + 0x200 = 0xDE00 (low)
# UTF-16-BE bytes: D8 3D DE 00

Decoding reverses the process:

C = 0x10000 + ((high - 0xD800) << 10) + (low - 0xDC00)
  = 0x10000 + (0x3D << 10) + 0x200
  = 0x10000 + 0xF400 + 0x200
  = 0x10000 + 0xF600
  = 0x1F600 ✓

Why this matters: JavaScript and Java represent strings internally as sequences of UTF-16 code units. When you call .length on a JavaScript string containing 😀, you get 2 (two surrogate code units), not 1 (one character). This is a source of countless bugs in string processing code. The emoji.length === 2 phenomenon bites developers who write code like:

// Wrong: counts UTF-16 code units, not characters
const len = str.length;

// Correct: spread into array of codepoints (Unicode-aware)
const len = [...str].length;

UTF-16 use cases: Windows API (entire Win32/WinRT uses UTF-16), Java, JavaScript, C# — all languages that committed to UTF-16 strings in the early Unicode era and can't easily change now.

UTF-8: The Variable-Length Champion

UTF-8 was designed by Ken Thompson and Rob Pike in a single evening in September 1992, according to Rob Pike's account of the infamous Waffle House diner incident. The design goals were clear: backward compatibility with ASCII, no null bytes in multibyte sequences (to avoid breaking C strings), and self-synchronizing (you can find character boundaries without reading from the beginning of the stream).

UTF-8 uses 1 to 4 bytes per codepoint, with this encoding structure:

Codepoint Range Byte Pattern Bytes Used
U+0000–U+007F 0xxxxxxx 1
U+0080–U+07FF 110xxxxx 10xxxxxx 2
U+0800–U+FFFF 1110xxxx 10xxxxxx 10xxxxxx 3
U+10000–U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4

The leading bits of each byte indicate its role: - 0xxxxxxx — single-byte character (ASCII) - 110xxxxx — start of 2-byte sequence - 1110xxxx — start of 3-byte sequence - 11110xxx — start of 4-byte sequence - 10xxxxxx — continuation byte

Encoding Examples

U+0041 (A, Latin Capital Letter A): - Binary: 01000001 - Fits in 1 byte (≤ U+007F): 0xxxxxxx010000010x41 - Result: 41 (identical to ASCII)

U+00E9 (é, Latin Small Letter E with Acute): - Binary: 11101001 - Fits in 2 bytes (U+0080–U+07FF): 110xxxxx 10xxxxxx - 00111 101001 → 110 00111 | 10 10100111000111 101010010xC3 0xA9 - Result: C3 A9

U+4E2D (中, CJK Unified Ideograph): - Hex: 4E2D = 0100 111000 101101 binary - Fits in 3 bytes (U+0800–U+FFFF): 1110xxxx 10xxxxxx 10xxxxxx - 0100 111000 101101 → 1110 0100 | 10 111000 | 10 101101 - → 0xE4 0xB8 0xAD - Result: E4 B8 AD

U+1F600 (😀, Grinning Face): - Hex: 1F600 = 000 011111 011000 000000 binary - Fits in 4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx - 000 011111 011000 000000 → 11110 000 | 10 011111 | 10 011000 | 10 000000 - → 0xF0 0x9F 0x98 0x80 - Result: F0 9F 98 80

Self-Synchronization: A Killer Feature

One of UTF-8's most valuable properties is self-synchronization. If you land at a random byte in a UTF-8 stream, you can immediately tell whether you're at the start of a character (byte starts with 0 or 11) or in the middle of a multibyte sequence (byte starts with 10). To find the next character boundary, you scan forward until you find a non-continuation byte. This makes random access and stream parsing much more robust than earlier multi-byte encodings.

The BOM in UTF-8

UTF-8 doesn't need a BOM because it has no byte-order ambiguity — bytes are bytes. However, some tools (notably Microsoft's Notepad historically) would prepend the bytes EF BB BF to UTF-8 files as a "UTF-8 BOM." These bytes are the UTF-8 encoding of U+FEFF, and they serve only as a file signature. They're not part of the text content, and they cause problems when present in files that are processed by tools not expecting them (a PHP file starting with EF BB BF <?php will send those three bytes as output before any HTML, breaking HTTP headers).

UTF-8 use cases: The web (97%+ of web pages), JSON (required by RFC 8259), XML, Linux/macOS file systems, Python 3 source files, Go source files, Rust, most modern programming languages and protocols.

Why UTF-8 Won the Web

The victory of UTF-8 over other encodings wasn't inevitable — it required deliberate advocacy. The key advantages:

  1. ASCII compatibility: The first 128 codepoints encode identically to ASCII. Legacy ASCII software processes UTF-8 correctly as long as it doesn't manipulate individual bytes in multibyte sequences.

  2. No null bytes: ASCII null (U+0000) encodes as 0x00 in UTF-8, just like ASCII. No other codepoint produces a null byte in a UTF-8 sequence, so C-style null-terminated strings work correctly for ASCII content.

  3. Space efficiency for Western text: English and most European-language text uses primarily U+0000–U+007F and U+0080–U+07FF, requiring 1–2 bytes per character. For CJK text, UTF-8 uses 3 bytes vs. UTF-16's 2 — slightly worse, but acceptable.

  4. Error resilience: Because continuation bytes have a distinctive pattern (10xxxxxx), a UTF-8 decoder that encounters an invalid byte can skip to the next character boundary without parsing the entire stream from the beginning.

  5. Network byte order irrelevant: No byte-order issues to negotiate. One encoding, one representation.

The web's adoption of UTF-8 as the default encoding (mandated by the HTML5 specification) was the decisive moment. Once HTTP servers started defaulting to UTF-8 and browsers defaulted to assuming UTF-8, the encoding wars for web content were essentially over.

Comparison Summary

Feature UTF-8 UTF-16 UTF-32
Space (ASCII) 1 byte 2 bytes 4 bytes
Space (CJK) 3 bytes 2 bytes 4 bytes
Space (Emoji) 4 bytes 4 bytes (surrogate pair) 4 bytes
Fixed width? No No (surrogates) Yes
ASCII compatible? Yes No No
BOM needed? No Recommended Recommended
Self-synchronizing? Yes Partial Yes
Dominant use case Web, files, Unix Windows, Java, JS Internal processing

Understanding which encoding your system uses — and where encoding conversion happens — is the first step to avoiding the encoding bugs that still plague real-world software. In the next chapter, we'll discover that even once encoding is solved, the question of what constitutes a "character" is far more complicated than you'd expect.