The Encoding Wars · Kapitel 5

UTF-8: The Encoding That Won

Ken Thompson and Rob Pike designed UTF-8 on a placemat at a New Jersey diner in 1992. This chapter traces UTF-8's path from invention to dominance, including Gmail's mandate and the web adoption curve.

~4.000 Wörter · ~16 Min. Lesezeit · · Updated

The date was September 1992. The place was a diner in New Jersey — specifically, according to the most commonly told version of the story, somewhere near Bell Labs in Murray Hill. Ken Thompson and Rob Pike, two of the most influential figures in computing history (the primary creators of Unix and co-creators of Go, respectively), were working on Plan 9 from Bell Space, Bell Labs' ambitious successor to Unix that had been in development since the mid-1980s. They needed to figure out how Plan 9 would handle text in a world that was moving toward Unicode.

The problem was acute. Plan 9 wanted to use Unicode for its internal text representation, enabling multilingual computing in a way that Unix had never supported. But Unicode at that time meant UCS-2, a fixed-width 16-bit encoding. Making Plan 9 use UCS-2 internally would break everything — every Unix utility, every text-processing tool, every protocol that had been designed around ASCII's property that every character was exactly one byte and every byte in the range 0-127 meant exactly what ASCII said. The Unix philosophy — stream everything through text pipes, make every tool speak the same character language — depended on that compatibility. Replace single bytes with double bytes and you had to rewrite cat, grep, sed, awk, diff, the shell, every language interpreter, every network protocol handler.

The story goes that Thompson sketched the key insight on a placemat — the encoding design that would solve the problem — and that within days they had an implementation running in the Plan 9 kernel. Whether or not the placemat is literal history, the speed of the implementation was real: Thompson and Pike had the UTF-8 encoding running in Plan 9 within a week of conceiving it.

The Genius of the Design

UTF-8's encoding rules are remarkably clean. Every Unicode code point is encoded as a sequence of 1 to 4 bytes. The number of bytes in the sequence is determined by the high bits of the first byte. All continuation bytes — the second, third, and fourth bytes in multi-byte sequences — begin with the bit pattern 10, which distinguishes them from any leading byte.

For code points U+0000 through U+007F (the ASCII range): 1 byte, format 0xxxxxxx. The 7 data bits directly encode the code point value.

For code points U+0080 through U+07FF (Latin Extended, Greek, Cyrillic, Hebrew, Arabic, and others): 2 bytes, format 110xxxxx 10xxxxxx. The 5 data bits of the first byte and the 6 data bits of the second byte, concatenated, encode the 11-bit code point value.

For code points U+0800 through U+FFFF (most of the Basic Multilingual Plane, including CJK ideographs): 3 bytes, format 1110xxxx 10xxxxxx 10xxxxxx. The 4+6+6=16 data bits encode the code point.

For code points U+10000 through U+10FFFF (supplementary planes, including historic scripts, musical notation, emoji, and others): 4 bytes, format 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. The 3+6+6+6=21 data bits encode the code point.

This design has properties that seem almost too good to be coincidental — but they follow naturally from the structure.

ASCII Compatibility

Every ASCII character (U+0000 to U+007F) has exactly the same encoding in UTF-8 as in ASCII. A byte in the range 0x00-0x7F always represents the same character it did in ASCII. No multi-byte UTF-8 sequence contains a byte in the range 0x00-0x7F — the structure of the encoding ensures this. This means that any program handling ASCII text will handle UTF-8 text correctly as long as it doesn't need to perform character-level operations (character counting, character indexing). The enormous installed base of ASCII-based Unix software — including all of C's standard library string functions for single-byte operations — could process UTF-8 text without modification for most common operations.

This was not a given. It was the result of a deliberate, careful design decision. Thompson and Pike wanted UTF-8 to work transparently in a Unix environment, and they achieved it. strchr(s, '/') looking for a slash would find it correctly in UTF-8 text because the byte 0x2F never appears as anything other than itself in UTF-8. Any other encoding of Unicode, including UTF-16 and UCS-2, would have broken this property.

Self-Synchronization

If you receive a stream of UTF-8 bytes and one byte is corrupted or dropped, you can resynchronize at the very next character boundary without knowing where the sequence started. The rule is simple: to find the start of a character, scan forward until you find a byte that does not begin with 10. Any byte beginning with 0 is a complete single-byte character. Any byte beginning with 11 is the start of a multi-byte sequence. Any byte beginning with 10 is a continuation byte — skip forward.

This property is critical for network protocols and file systems where data can be corrupted or transmitted out of order. In contrast, a corrupted byte in a fixed-width UTF-16 stream can knock the entire remainder of the stream one byte off phase, causing every subsequent character pair to be read in the wrong order.

Sort-Order Preservation

UTF-8 encodes code points in a way that preserves lexicographic order. If code point A has a lower numeric value than code point B, then the UTF-8 encoding of A sorts lexicographically before the UTF-8 encoding of B. This is because the encoding is monotonically order-preserving: the byte sequences get lexicographically larger as the code point values get larger.

This means that byte-level sorting and comparison of UTF-8 strings gives the same result as Unicode code point ordering. Sorting UTF-8 strings by comparing them as byte arrays — which every language runtime and database already does correctly for arbitrary byte strings — gives Unicode code point order. UTF-16 does not have this property: because of how surrogate pairs work, sorting UTF-16 strings as 16-bit unit arrays does not give correct Unicode code point order for supplementary characters.

No Byte Order Mark Required

UTF-16 has a fundamental ambiguity: a 16-bit value can be stored with the most significant byte first (big-endian) or the least significant byte first (little-endian). To resolve this, Unicode defines a Byte Order Mark (BOM): the code point U+FEFF, if it appears at the start of a UTF-16 stream, identifies the byte order. If the first two bytes are 0xFE 0xFF, you're reading big-endian; if they're 0xFF 0xFE, you're reading little-endian. UTF-8 has no byte-order issue because each unit is a single byte. The UTF-8 BOM (the byte sequence 0xEF 0xBB 0xBF, which is the UTF-8 encoding of U+FEFF) exists as an optional hint to editors but provides no technical necessity. Its presence in UTF-8 files causes problems in contexts that don't expect it — particularly on Unix systems where many tools assume no file begins with a multi-byte sequence unless it's actual text content.

RFC 3629 and Security Considerations

Plan 9 adopted UTF-8 internally in 1992, but the encoding needed formal standardization for internet use. Thompson and Pike published their design in a Bell Labs technical memo (FSS-UTF), and it was gradually adopted by other systems. The Internet Engineering Task Force standardized UTF-8 as RFC 2279 in January 1998. An updated version, RFC 3629, was published in November 2003 and remains the current standard.

RFC 3629 made two important clarifications. First, it restricted the valid code point range to U+0000 through U+10FFFF, matching the Unicode standard's range. (The original UTF-8 design theoretically allowed encoding up to U+7FFFFFFF using 6-byte sequences, but this was unnecessary and confusing.) Second, it explicitly prohibited "overlong encodings" — encoding a code point with more bytes than the minimum required.

The security importance of the overlong encoding prohibition deserves explanation. In UTF-8, any code point has exactly one valid encoding — the shortest possible byte sequence. But the bit patterns technically allow encoding small code points with more bytes. The code point U+0041 (Latin capital letter A, ASCII value 65) has the canonical 1-byte encoding 0x41, but the 2-byte sequence 0xC1 0x81 would encode the same code point with overlong encoding. Before RFC 3629, some implementations accepted overlong encodings, and attackers used this to bypass security filters. A web server that blocked the path component /etc/passwd by scanning for the byte sequence 0x2F (ASCII slash) would not catch the overlong encoding 0xC0 0xAF for U+002F (slash). Several CVEs were assigned for this class of vulnerability. RFC 3629's prohibition of overlong encodings closes this attack vector: a conforming UTF-8 implementation must reject them.

The Web's Conversion

The adoption curve of UTF-8 on the web is one of the most dramatic technology transitions in internet history. W3C's internationalization work in the mid-1990s recommended UTF-8, and HTML 4.0 (1997) made UTF-8 a recommended encoding for new HTML content. But actual adoption was slow in the late 1990s — the web was dominated by ISO 8859-1 and Shift_JIS content created with existing tools that defaulted to those encodings.

By 2008, Google's web crawl data showed UTF-8 had become the single most common encoding on the web, displacing ASCII (which is technically a subset of UTF-8, so the transition was largely invisible) and ISO 8859-1. Growth accelerated: 2010 saw UTF-8 at roughly 50% of web pages; 2012 saw it above 60%; 2015 above 80%; 2020 above 95%; 2025 above 98%. The hockey-stick adoption curve was driven by HTML5's specification of UTF-8 as the default encoding for HTML documents, by the major web frameworks (Rails, Django, Express, Spring) adopting UTF-8 as their default, and by database systems (MySQL, PostgreSQL) making UTF-8 their recommended character set for new databases.

Python's transition from Python 2 to Python 3 was intimately tied to UTF-8. In Python 2, the str type was a byte string — it could hold ASCII or any 8-bit encoding depending on what you put in it, with no explicit tracking of the encoding. The unicode type held Unicode text but required explicit u"" literals and could not be seamlessly mixed with str. This dual-type system was a constant source of UnicodeDecodeError and UnicodeEncodeError exceptions. Python 3 unified the types: str is always Unicode (internally stored as either Latin-1, UCS-2, or UTF-32 depending on the content, with UTF-8 as the standard encoding for I/O). Python 3.0 was released in 2008, but the transition from Python 2 was slow and contentious — Python 2 reached official end-of-life on January 1, 2020, forcing the last holdouts to convert. The encoding clarity of Python 3 is now considered one of its defining improvements.

Why UTF-8 Beat Its Rivals

UTF-16 offers a real advantage for text composed primarily of characters in the Basic Multilingual Plane: 2 bytes per character is more compact than UTF-8's 3 bytes per character for CJK ideographs. Java, C#/.NET, Windows NT, and JavaScript all chose UTF-16 as their internal string encoding on this basis. For systems processing large volumes of Chinese, Japanese, or Korean text, UTF-16 is measurably more efficient.

But for the web, where markup (HTML tags, attribute names, URLs) and programming keywords are ASCII, UTF-8's 1-byte ASCII encoding wins decisively. An HTML page is perhaps 40-60% ASCII content by byte count (tags, attributes, whitespace). UTF-8 encodes that 40-60% at 1 byte per character; UTF-16 would encode it at 2 bytes per character. The result is that UTF-8 HTML files are typically 20-30% smaller than the equivalent UTF-16 files. At web scale — billions of page views per day — this matters.

UTF-32 (4 bytes per character, always) provides constant-time random access by character index but at a cost of 4× the space of ASCII text and 2× the space of UTF-16. It is used internally in some systems where character-level indexing must be O(1) — CPython uses a variant internally for certain string operations — but is never used for web transmission or storage.

The victory of UTF-8 was not merely technical. It was also political and economic. ASCII compatibility allowed incremental adoption without breaking existing systems. Self-synchronization made it robust to the kind of data corruption that was common on the early internet. Sort-order preservation made it compatible with existing database indexing infrastructure. And simplicity made it easy to implement correctly — the encoding rules fit on a single printed page, a property that mattered enormously in the early days when every networking library was a new implementation.

Ken Thompson and Rob Pike designed UTF-8 in an evening, implemented it in a week, and published it in a technical report that most of the industry initially ignored. A decade later, it was the most widely used character encoding on earth. The engineers who created Unix, who created Go, who inspired generations of systems programmers — their casual solution to an encoding problem that was annoying them on a Tuesday is now the silent, invisible infrastructure under every web page, every text file, every program source file in the world.