📚 Unicode Fundamentals

UTF-8 Encoding Explained

UTF-8 is the dominant character encoding on the web, capable of representing every Unicode character using one to four bytes. This guide explains how UTF-8 works, why it became the default encoding, and how to use it correctly in your projects.

·

UTF-8 is the dominant text encoding on the internet — as of 2024, over 98% of web pages declare it — yet most developers treat it as a black box. You know to use it, you know what it is called, but do you know why a simple ASCII letter takes one byte while a snowman (☃) takes three? This guide opens the box and shows you exactly how UTF-8 works, byte by byte.

What UTF-8 Actually Is

UTF-8 is a variable-width character encoding capable of representing every Unicode code point (U+0000 through U+10FFFF). It encodes each code point using between one and four bytes, choosing the minimum number of bytes needed for each character.

It was designed by Ken Thompson and Rob Pike on a restaurant placemat in September 1992, and first deployed in Plan 9 from Bell Labs. Its design goals were clear:

  1. Be backwards-compatible with 7-bit ASCII.
  2. Use as few bytes as possible for common characters.
  3. Be self-synchronizing — you can find the start of any character without scanning from the beginning of the string.
  4. Be unambiguous — no byte sequence is a prefix of another valid sequence.

The Byte Structure

The encoding scheme assigns code points to byte ranges as follows:

Code Point Range Byte Count Byte 1 Byte 2 Byte 3 Byte 4
U+0000–U+007F 1 byte 0xxxxxxx
U+0080–U+07FF 2 bytes 110xxxxx 10xxxxxx
U+0800–U+FFFF 3 bytes 1110xxxx 10xxxxxx 10xxxxxx
U+10000–U+10FFFF 4 bytes 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The x bits are filled with the binary representation of the code point value.

The leading byte tells you — and any parser — exactly how many bytes to read: - Starts with 0: single-byte (ASCII range). - Starts with 110: two-byte sequence. - Starts with 1110: three-byte sequence. - Starts with 11110: four-byte sequence. - Starts with 10: continuation byte (not a valid start; part of a multi-byte sequence).

This is the "self-synchronizing" property. If you're dropped into the middle of a UTF-8 stream, you can skip forward until you see a byte that doesn't start with 10, and you've found the start of a character.

Walking Through Examples

Example 1: ASCII letter "A" (U+0041)

Code point U+0041 = decimal 65 = binary 0100 0001.

It falls in the range U+0000–U+007F, so it uses one byte:

Binary:  0 1000001
Byte:    0x41

This is identical to ASCII. A pure ASCII file is, byte-for-byte, valid UTF-8.

Example 2: "é" — LATIN SMALL LETTER E WITH ACUTE (U+00E9)

Code point U+00E9 = decimal 233 = binary 1110 1001.

It falls in U+0080–U+07FF, so two bytes are needed. The bit pattern is 110xxxxx 10xxxxxx. With 11 available x bits for a 8-bit code point value:

Code point: 000 11101001   (11 bits, zero-padded)
Split:      00011  101001
Template:   110xxxxx  10xxxxxx
Filled:     11000011  10101001
Hex:        0xC3      0xA9

In Python:

>>> "é".encode("utf-8")
b'\xc3\xa9'
>>> len("é".encode("utf-8"))
2

Example 3: "☃" — SNOWMAN (U+2603)

Code point U+2603 = decimal 9731 = binary 0010 0110 0000 0011.

It falls in U+0800–U+FFFF, so three bytes are needed. Template: 1110xxxx 10xxxxxx 10xxxxxx (16 available x bits):

Code point: 0010 011000 000011   (16 bits)
Split:      0010  011000  000011
Template:   1110xxxx  10xxxxxx  10xxxxxx
Filled:     11100010  10011000  10000011
Hex:        0xE2      0x98      0x83

In Python:

>>> "☃".encode("utf-8")
b'\xe2\x98\x83'
>>> len("☃".encode("utf-8"))
3

Example 4: "😀" — GRINNING FACE (U+1F600)

Code point U+1F600 = decimal 128512 = binary 0001 1111 0110 0000 0000.

It falls in U+10000–U+10FFFF, so four bytes are needed. Template: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (21 available x bits):

Code point: 000 011111 011000 000000  (21 bits)
Split:      000  011111  011000  000000
Template:   11110xxx  10xxxxxx  10xxxxxx  10xxxxxx
Filled:     11110000  10011111  10011000  10000000
Hex:        0xF0      0x9F      0x98      0x80

In Python:

>>> "😀".encode("utf-8")
b'\xf0\x9f\x98\x80'
>>> len("😀".encode("utf-8"))
4

ASCII Compatibility

The single most important design decision in UTF-8 is its backwards compatibility with ASCII. Every byte in the range 0x00–0x7F represents the same character in both ASCII and UTF-8. This means:

  • Any valid ASCII file is simultaneously valid UTF-8.
  • Any tool or protocol that processes ASCII text (HTTP headers, JSON keys, HTML tags, most source code) works transparently with UTF-8 without modification.
  • No null bytes appear except for the actual U+0000 NULL character — unlike UTF-16 which inserts null bytes for ASCII characters, breaking C string functions.

This compatibility is why UTF-8 was able to displace older encodings without requiring all software to be rewritten simultaneously.

The Byte Order Mark (BOM)

UTF-8 has no endianness — bytes are always in the same order — so strictly speaking, a BOM is unnecessary. However, some software (particularly on Windows, and especially older versions of Excel and Visual Studio) prepends the three-byte sequence EF BB BF to UTF-8 files as a "UTF-8 BOM" or "signature".

This is generally harmful and should be avoided:

# Writing a file WITHOUT BOM (preferred)
with open("file.txt", "w", encoding="utf-8") as f:
    f.write("Hello")

# Writing WITH BOM (avoid unless specifically required)
with open("file.txt", "w", encoding="utf-8-sig") as f:
    f.write("Hello")

# Detecting a BOM when reading
with open("file.txt", "rb") as f:
    raw = f.read(3)
    has_bom = raw.startswith(b"\xef\xbb\xbf")

The BOM causes problems in shell scripts (#!/usr/bin/env python fails if there's a BOM before it), in HTTP headers (they become invalid), and in concatenated files (only the first file's BOM is correct; subsequent BOMs appear as the character U+FEFF, ZERO WIDTH NO-BREAK SPACE, in the middle of content).

Why UTF-8 Won the Web

UTF-8 beat UTF-16 and UTF-32 for web use because of a combination of technical and practical advantages:

  1. Space efficiency for ASCII-heavy text: HTML, JSON, XML, and most source code are predominantly ASCII. UTF-8 encodes these at 1 byte/character; UTF-16 would double the size.

  2. No null bytes: HTTP, many databases, and C string functions treat 0x00 as a string terminator. UTF-16 encodes ASCII characters with null bytes (e.g., 'A' = 0x00 0x41), breaking all of these. UTF-8 never produces a null byte except for the actual null character.

  3. No endianness issues: UTF-16 and UTF-32 require a byte order mark or agreement on byte order. UTF-8 has none of this complexity.

  4. ASCII compatibility: Existing ASCII tools, parsers, and data could be adopted incrementally.

  5. Self-synchronization: Error recovery is straightforward — continuation bytes are distinguishable from lead bytes.

The W3C recommended UTF-8 as the character encoding for all new protocols in 2014 (in the "Character Model for the World Wide Web: String Matching"). Today it is the default encoding for HTML5, JSON (RFC 8259), and most modern programming languages.

Decoding Errors

Not every byte sequence is valid UTF-8. When a decoder encounters an invalid sequence, it has several options controlled by an errors parameter:

bad_bytes = b"Hello \xff World"   # 0xFF is never valid in UTF-8

# strict (default): raises UnicodeDecodeError
try:
    bad_bytes.decode("utf-8")
except UnicodeDecodeError as e:
    print(e)  # 'utf-8' codec can't decode byte 0xff in position 6

# replace: substitutes U+FFFD REPLACEMENT CHARACTER (often shown as )
print(bad_bytes.decode("utf-8", errors="replace"))  # Hello  World

# ignore: silently drops invalid bytes
print(bad_bytes.decode("utf-8", errors="ignore"))   # Hello  World

# backslashreplace: uses \xNN escape sequences
print(bad_bytes.decode("utf-8", errors="backslashreplace"))  # Hello \xff World

For security-sensitive code, always use strict mode and handle the error explicitly — silent replacement or ignoring can mask data corruption or injection attacks.

Key Takeaways

  • UTF-8 encodes each Unicode code point as 1–4 bytes depending on the code point value.
  • Code points U+0000–U+007F encode identically to ASCII — UTF-8 is a strict superset of ASCII.
  • Lead bytes start with 0, 110, 1110, or 11110; continuation bytes start with 10.
  • This design makes UTF-8 self-synchronizing — you can always find the start of a character.
  • The BOM (EF BB BF) is unnecessary in UTF-8 and should be avoided.
  • UTF-8 won the web because of ASCII compatibility, no null bytes, and space efficiency for Latin-script content.
  • Always handle decode errors explicitly in production code — use strict mode and catch UnicodeDecodeError.

Unicode Fundamentals의 더 많은 가이드

What is Unicode? A Complete Guide

Unicode is the universal character encoding standard that assigns a unique number …

UTF-8 vs UTF-16 vs UTF-32: When to Use Each

UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …

What is a Unicode Code Point?

A Unicode code point is the unique number assigned to each character …

Unicode Planes and the BMP

Unicode is divided into 17 planes, each containing up to 65,536 code …

Understanding Byte Order Mark (BOM)

The Byte Order Mark (BOM) is a special Unicode character used at …

Surrogate Pairs Explained

Surrogate pairs are a mechanism in UTF-16 that allows code points outside …

ASCII to Unicode: The Evolution of Character Encoding

ASCII defined 128 characters for the English alphabet and was the foundation …

Unicode Normalization: NFC, NFD, NFKC, NFKD

The same visible character can be represented by multiple different byte sequences …

The Unicode Bidirectional Algorithm

The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …

Unicode General Categories Explained

Every Unicode character belongs to a general category such as Letter, Number, …

Understanding Unicode Blocks

Unicode blocks are contiguous ranges of code points grouped by script or …

Unicode Scripts: How Writing Systems are Organized

Unicode assigns every character to a script property that identifies the writing …

What are Combining Characters?

Combining characters are Unicode code points that attach to a preceding base …

Grapheme Clusters vs Code Points

A single visible character on screen — called a grapheme cluster — …

Unicode Confusables: A Security Guide

Unicode confusables are characters that look identical or nearly identical to others, …

Zero Width Characters: What They Are and Why They Matter

Zero-width characters are invisible Unicode code points that affect text layout, joining, …

Unicode Whitespace Characters Guide

Unicode defines over two dozen whitespace characters beyond the ordinary space, including …

History of Unicode

Unicode began in 1987 as a collaboration between engineers at Apple and …

Unicode Versions Timeline

Unicode has released major versions regularly since 1.0 in 1991, with each …