The Encoding Wars · Глава 1
Morse, Baudot, and the First Codes
The story of character encoding begins with the telegraph. This chapter traces the evolution from Morse code to Baudot's 5-bit teletypewriter code, laying the foundation for the digital age.
Before there were bytes, before there were bits in the modern sense, before anyone had conceived of a computer, human civilization had already solved a profound problem: how do you reduce the infinite variety of human language into discrete electrical pulses that can travel across a wire? The answer came not from a mathematician or an engineer in the formal sense, but from a painter with an obsession, a French telegraph operator with a dream, and the relentless pressure of commerce demanding speed.
The story of character encoding is, at its deepest roots, a story about constraint. Every encoding system ever devised — from the first click of a telegraph key to the 154,998 characters of Unicode 16.0 — has been shaped by the tension between the richness of human expression and the brutal limitations of the physical medium carrying it. Understanding that tension is understanding everything that came after.
Samuel Morse and the Birth of Machine-Readable Text
In 1838, Samuel Finley Breese Morse demonstrated the first working electromagnetic telegraph in New York. Morse was, by training and passion, a portrait painter — he had spent years in Europe studying the masters, and had painted the Marquis de Lafayette in 1825. But on a ship voyage from France in 1832, he overheard a conversation about electromagnetism that consumed the rest of his life. The result was a system that would, within two decades, bind the American continent together with an invisible nervous system of wire.
Morse code was not quite what we think of it today. The original Morse code of 1838, which he called American Morse or Railroad Morse, was messier than the International Morse Code standardized in 1851 — it included long dashes, short spaces, and several letters encoded with awkward gaps that made automatic machine reading difficult. But the core insight was revolutionary: assign variable-length sequences of short signals (dots) and long signals (dashes) to each letter, using shorter sequences for more common letters. The letter E was a single dot. T was a single dash. This frequency-based compression was, in a real sense, the first data compression algorithm in history, predating Huffman coding by over a century.
The economic stakes were immediate and enormous. A transatlantic telegraph message in the 1860s cost ten dollars per word — more than two hundred dollars in today's money. Every extra dot or dash added cost. Morse's frequency-based assignment, where the most common English letters received the shortest codes, was not aesthetic cleverness but financial necessity. Operators sending high volumes of business messages paid close attention to which letters were cheap to transmit and which were expensive. The word "the" — consisting of three very short codes (dash, dot / dot / dot) — was a bargain. A word laden with X's and Q's cost more.
What Morse code could not do was be read by a machine. A trained telegrapher could decode perhaps 25-30 words per minute, but the speed of a human brain was the bottleneck. The wire could carry signals far faster than any person could receive them. Something more mechanically systematic was needed.
Émile Baudot and the Five-Bit Revolution
In 1870, a young French telegraph operator named Jean-Maurice-Émile Baudot filed a patent for a revolutionary new system. Where Morse had used variable-length codes, Baudot used fixed-length codes — every character was exactly 5 bits, creating 32 possible combinations. The operator typed on a keyboard of just five keys, pressing combinations of keys simultaneously like chords on a piano, and the machine encoded and transmitted the result. The system was demonstrated publicly in 1874 and adopted by the French telegraph administration in 1875.
The 5-bit limit was not arbitrary — it was the practical limit of what a single operator's two hands could comfortably chord. With 32 combinations (2^5), Baudot could encode 26 letters, 10 digits, and some punctuation — just barely enough for practical telegraphy, but only if he was clever about it.
His solution was a "shift" mechanism: two of the 32 codes (he called them "letter shift" and "figure shift") switched the interpretation of all subsequent codes. In letter mode, code 00011 might represent the letter C. In figure mode, the same code might represent the number 3. This single insight doubled the effective character set from 32 to roughly 60 distinguishable symbols and introduced a concept — the state-dependent code — that would echo through computing history in ways Baudot could never have imagined. State-dependent codes appear in UTF-7, in base64, in MIME content transfer encoding, in virtually every compact encoding scheme ever designed.
Baudot's original system required five operators working in synchrony, each responsible for one bit, and used a special clock mechanism to sample all five lines simultaneously. This was impractical for widespread deployment. It was Donald Murray, a New Zealand-born inventor who later emigrated to Britain, who in 1899 redesigned the system to work with a typewriter-style keyboard and punched paper tape. Murray observed that rather than five synchronized operators, you could have one operator type on a keyboard that punched holes in paper tape, then feed that tape automatically into a transmitter. This separation of composition from transmission was a crucial conceptual leap.
Murray's version, which the International Telecommunication Union standardized as ITA2 (International Telegraph Alphabet No. 2) in 1932, became the basis for the Teletype machine and remained in active use for over a century. ITA2 is not some historical curiosity — it is the direct ancestor of the terminal control characters that every programmer uses today.
ITA2 and the Tyranny of 5 Bits
ITA2 defined 32 codes, split between letters mode and figures mode, with SP (space) appearing in both. The codes were, by modern standards, chaotic — the relationship between a letter and its bit pattern followed no obvious mathematical logic. There was no notion of alphabetical ordering in the codes, no relationship between uppercase and lowercase (ITA2 had no lowercase at all), no systematic organization by frequency of use. Characters were assigned codes based on the mechanical convenience of early Baudot keyboards, and those assignments were frozen for decades.
This had consequences that lasted well into the computer era. The BEL character (ASCII code 7, which rings a bell), the BS (backspace), CR (carriage return), LF (line feed), HT (horizontal tab) — all of these trace their existence directly to the physical actions of the Teletype machine. When you press Enter in a terminal today and your code needs to handle \\r\\n line endings on Windows, you are dealing with the ghost of a mechanical carriage that slapped back to the left margin (CR) and a separate mechanism that advanced the paper up one line (LF). The physical actions became the codes. The codes became the standard. The standard became the legacy that every programmer must manage.
The 5-bit limit of ITA2 meant that every system built on it was, by design, English-centric and Latin-script-centric. There was no room for accented characters. The French, who had invented the system, couldn't properly type their own language on it — Baudot code had no é, è, ê, à, ù, or ç. The Swedes had no å or ö. Germans had no ü or ß. This irony — a French invention that could not represent French — would repeat itself throughout encoding history, always in the same pattern: the engineers building the system optimized for their immediate needs and left accents, diacritics, and non-Latin characters as someone else's problem.
The Punch Card Parallel
Running in parallel with the telegraph encoding tradition was a completely separate encoding tradition: punch cards. Herman Hollerith developed his punch card system in the 1880s for the United States Census, and IBM standardized the 80-column punch card format in 1928. The IBM card code, later extended into EBCDIC (Extended Binary Coded Decimal Interchange Code), had an entirely different character encoding from Baudot/ITA2 — and when IBM dominated the computing industry in the 1950s and 1960s, EBCDIC competed directly with the Baudot-descended ASCII for the right to become the universal standard. The encoding wars began long before anyone called them that.
The Teletype Legacy
The Teletype Corporation's Model 33 ASR (Automatic Send-Receive), introduced in 1963, was perhaps the most historically significant terminal in computing history. Universities and research labs acquired these machines by the thousands as cheap interfaces to early timesharing systems — they cost about $700 in 1963, compared to thousands of dollars for other terminals, making them the gateway through which a generation of programmers first touched computers. The Model 33 communicated at 110 baud (roughly 110 bits per second, or about 10 characters per second), and this speed was so foundational that "10 characters per second" became the implicit mental model of interactive computing for years.
The Model 33's keyboard layout, its 110-baud serial connection, its physical uppercase-only constraint — all of these shaped the early Unix environment in ways that persisted for decades. Ken Thompson and Dennis Ritchie designed early Unix to work with the Model 33's limitations. The result was a coding culture built around short variable names (to save typing on a slow machine), the pipe concept (to chain small programs rather than needing to display everything at once), and a philosophy of terseness that still pervades Unix and its descendants.
When you type grep or awk or sed, you are partly solving the problem of a telegraph operator in the 1870s who needed to transmit messages quickly and cheaply. The bandwidth constraint that shaped Baudot's 5-bit codes, which shaped Murray's paper tape system, which shaped the Model 33 Teletype, which shaped early Unix, which shaped the Internet's command-line tools — the line of causation is long, but it is real.
The Speed-Bandwidth-Character Tradeoff
Every encoding system from Morse to Baudot to ASCII was shaped by a single fundamental tradeoff: more characters require more bits per character, but more bits per character means slower transmission and higher cost. In the telegraph era, wire time was expensive. This economic pressure explains why Morse used variable-length codes (common letters got short codes, saving money on the most-used characters) while Baudot used fixed-length codes (because fixed-length enabled mechanical automation, saving labor costs). Different constraints, different solutions — but both solutions were optimal given their context.
The same tradeoff would appear again and again in encoding history. ASCII used 7 bits because 7 bits fit in a 7-track paper tape that was cheaper to manufacture than 8-track tape, and because the parity check bit in the 8th position was more valuable than the extra 128 characters it could have provided. UTF-8 used variable-length encoding specifically to give ASCII characters a 1-byte encoding while still supporting all of Unicode — the same insight Morse had in 1838, now applied to a digital alphabet of 154,998 characters. Even today, when storage and bandwidth are effectively free by the standards of the telegraph era, encoding efficiency matters in high-volume systems: Twitter's character limits, SMS's 160-character constraint, the structure of QR codes — all are shaped by the tension between expressive completeness and physical limitation.
The telegraph operators who tapped out messages in the 1860s, the Teletype operators who chorded ITA2 codes in the 1930s, the programmers who debug Unicode normalization issues today — they are all working on the same problem, at different scales, with different technology, but always under the pressure of the same fundamental constraint: infinite human expression, finite physical medium.
What Morse and Baudot established was not just a pair of encoding systems. They established the discipline of encoding itself — the idea that every character must be reducible to a finite, unambiguous sequence of discrete signals, and that the design of that reduction has profound consequences for everything built on top of it. The choices you make at the encoding layer echo through every abstraction built above it, for decades or centuries. Choose poorly, and you create debt that must be paid by every programmer who touches the system afterward. Choose well, and you build a foundation that supports structures you never imagined.
Every encoding decision made in the following 185 years would be, consciously or not, a response to the choices Morse and Baudot made and the constraints they faced. The Encoding Wars were not a conflict that began in the 1980s with code pages. They began in 1838, with a painter sitting at a telegraph key, deciding that E should be a single dot.