UTF-8 Encoding Explained
UTF-8 is the dominant character encoding on the web, capable of representing every Unicode character using one to four bytes. This guide explains how UTF-8 works, why it became the default encoding, and how to use it correctly in your projects.
UTF-8 is the dominant text encoding on the internet — as of 2024, over 98% of web pages declare it — yet most developers treat it as a black box. You know to use it, you know what it is called, but do you know why a simple ASCII letter takes one byte while a snowman (☃) takes three? This guide opens the box and shows you exactly how UTF-8 works, byte by byte.
What UTF-8 Actually Is
UTF-8 is a variable-width character encoding capable of representing every Unicode code point (U+0000 through U+10FFFF). It encodes each code point using between one and four bytes, choosing the minimum number of bytes needed for each character.
It was designed by Ken Thompson and Rob Pike on a restaurant placemat in September 1992, and first deployed in Plan 9 from Bell Labs. Its design goals were clear:
- Be backwards-compatible with 7-bit ASCII.
- Use as few bytes as possible for common characters.
- Be self-synchronizing — you can find the start of any character without scanning from the beginning of the string.
- Be unambiguous — no byte sequence is a prefix of another valid sequence.
The Byte Structure
The encoding scheme assigns code points to byte ranges as follows:
| Code Point Range | Byte Count | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
|---|---|---|---|---|---|
| U+0000–U+007F | 1 byte | 0xxxxxxx |
— | — | — |
| U+0080–U+07FF | 2 bytes | 110xxxxx |
10xxxxxx |
— | — |
| U+0800–U+FFFF | 3 bytes | 1110xxxx |
10xxxxxx |
10xxxxxx |
— |
| U+10000–U+10FFFF | 4 bytes | 11110xxx |
10xxxxxx |
10xxxxxx |
10xxxxxx |
The x bits are filled with the binary representation of the code point value.
The leading byte tells you — and any parser — exactly how many bytes to read:
- Starts with 0: single-byte (ASCII range).
- Starts with 110: two-byte sequence.
- Starts with 1110: three-byte sequence.
- Starts with 11110: four-byte sequence.
- Starts with 10: continuation byte (not a valid start; part of a multi-byte sequence).
This is the "self-synchronizing" property. If you're dropped into the middle of a UTF-8 stream,
you can skip forward until you see a byte that doesn't start with 10, and you've found the
start of a character.
Walking Through Examples
Example 1: ASCII letter "A" (U+0041)
Code point U+0041 = decimal 65 = binary 0100 0001.
It falls in the range U+0000–U+007F, so it uses one byte:
Binary: 0 1000001
Byte: 0x41
This is identical to ASCII. A pure ASCII file is, byte-for-byte, valid UTF-8.
Example 2: "é" — LATIN SMALL LETTER E WITH ACUTE (U+00E9)
Code point U+00E9 = decimal 233 = binary 1110 1001.
It falls in U+0080–U+07FF, so two bytes are needed. The bit pattern is 110xxxxx 10xxxxxx.
With 11 available x bits for a 8-bit code point value:
Code point: 000 11101001 (11 bits, zero-padded)
Split: 00011 101001
Template: 110xxxxx 10xxxxxx
Filled: 11000011 10101001
Hex: 0xC3 0xA9
In Python:
>>> "é".encode("utf-8")
b'\xc3\xa9'
>>> len("é".encode("utf-8"))
2
Example 3: "☃" — SNOWMAN (U+2603)
Code point U+2603 = decimal 9731 = binary 0010 0110 0000 0011.
It falls in U+0800–U+FFFF, so three bytes are needed. Template: 1110xxxx 10xxxxxx 10xxxxxx
(16 available x bits):
Code point: 0010 011000 000011 (16 bits)
Split: 0010 011000 000011
Template: 1110xxxx 10xxxxxx 10xxxxxx
Filled: 11100010 10011000 10000011
Hex: 0xE2 0x98 0x83
In Python:
>>> "☃".encode("utf-8")
b'\xe2\x98\x83'
>>> len("☃".encode("utf-8"))
3
Example 4: "😀" — GRINNING FACE (U+1F600)
Code point U+1F600 = decimal 128512 = binary 0001 1111 0110 0000 0000.
It falls in U+10000–U+10FFFF, so four bytes are needed. Template: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
(21 available x bits):
Code point: 000 011111 011000 000000 (21 bits)
Split: 000 011111 011000 000000
Template: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Filled: 11110000 10011111 10011000 10000000
Hex: 0xF0 0x9F 0x98 0x80
In Python:
>>> "😀".encode("utf-8")
b'\xf0\x9f\x98\x80'
>>> len("😀".encode("utf-8"))
4
ASCII Compatibility
The single most important design decision in UTF-8 is its backwards compatibility with ASCII. Every byte in the range 0x00–0x7F represents the same character in both ASCII and UTF-8. This means:
- Any valid ASCII file is simultaneously valid UTF-8.
- Any tool or protocol that processes ASCII text (HTTP headers, JSON keys, HTML tags, most source code) works transparently with UTF-8 without modification.
- No null bytes appear except for the actual U+0000 NULL character — unlike UTF-16 which inserts null bytes for ASCII characters, breaking C string functions.
This compatibility is why UTF-8 was able to displace older encodings without requiring all software to be rewritten simultaneously.
The Byte Order Mark (BOM)
UTF-8 has no endianness — bytes are always in the same order — so strictly speaking, a BOM is
unnecessary. However, some software (particularly on Windows, and especially older versions of
Excel and Visual Studio) prepends the three-byte sequence EF BB BF to UTF-8 files as a "UTF-8
BOM" or "signature".
This is generally harmful and should be avoided:
# Writing a file WITHOUT BOM (preferred)
with open("file.txt", "w", encoding="utf-8") as f:
f.write("Hello")
# Writing WITH BOM (avoid unless specifically required)
with open("file.txt", "w", encoding="utf-8-sig") as f:
f.write("Hello")
# Detecting a BOM when reading
with open("file.txt", "rb") as f:
raw = f.read(3)
has_bom = raw.startswith(b"\xef\xbb\xbf")
The BOM causes problems in shell scripts (#!/usr/bin/env python fails if there's a BOM before
it), in HTTP headers (they become invalid), and in concatenated files (only the first file's BOM
is correct; subsequent BOMs appear as the character U+FEFF, ZERO WIDTH NO-BREAK SPACE, in the
middle of content).
Why UTF-8 Won the Web
UTF-8 beat UTF-16 and UTF-32 for web use because of a combination of technical and practical advantages:
-
Space efficiency for ASCII-heavy text: HTML, JSON, XML, and most source code are predominantly ASCII. UTF-8 encodes these at 1 byte/character; UTF-16 would double the size.
-
No null bytes: HTTP, many databases, and C string functions treat
0x00as a string terminator. UTF-16 encodes ASCII characters with null bytes (e.g., 'A' =0x00 0x41), breaking all of these. UTF-8 never produces a null byte except for the actual null character. -
No endianness issues: UTF-16 and UTF-32 require a byte order mark or agreement on byte order. UTF-8 has none of this complexity.
-
ASCII compatibility: Existing ASCII tools, parsers, and data could be adopted incrementally.
-
Self-synchronization: Error recovery is straightforward — continuation bytes are distinguishable from lead bytes.
The W3C recommended UTF-8 as the character encoding for all new protocols in 2014 (in the "Character Model for the World Wide Web: String Matching"). Today it is the default encoding for HTML5, JSON (RFC 8259), and most modern programming languages.
Decoding Errors
Not every byte sequence is valid UTF-8. When a decoder encounters an invalid sequence, it has
several options controlled by an errors parameter:
bad_bytes = b"Hello \xff World" # 0xFF is never valid in UTF-8
# strict (default): raises UnicodeDecodeError
try:
bad_bytes.decode("utf-8")
except UnicodeDecodeError as e:
print(e) # 'utf-8' codec can't decode byte 0xff in position 6
# replace: substitutes U+FFFD REPLACEMENT CHARACTER (often shown as )
print(bad_bytes.decode("utf-8", errors="replace")) # Hello World
# ignore: silently drops invalid bytes
print(bad_bytes.decode("utf-8", errors="ignore")) # Hello World
# backslashreplace: uses \xNN escape sequences
print(bad_bytes.decode("utf-8", errors="backslashreplace")) # Hello \xff World
For security-sensitive code, always use strict mode and handle the error explicitly — silent
replacement or ignoring can mask data corruption or injection attacks.
Key Takeaways
- UTF-8 encodes each Unicode code point as 1–4 bytes depending on the code point value.
- Code points U+0000–U+007F encode identically to ASCII — UTF-8 is a strict superset of ASCII.
- Lead bytes start with
0,110,1110, or11110; continuation bytes start with10. - This design makes UTF-8 self-synchronizing — you can always find the start of a character.
- The BOM (
EF BB BF) is unnecessary in UTF-8 and should be avoided. - UTF-8 won the web because of ASCII compatibility, no null bytes, and space efficiency for Latin-script content.
- Always handle decode errors explicitly in production code — use
strictmode and catchUnicodeDecodeError.
เพิ่มเติมใน Unicode Fundamentals
Unicode is the universal character encoding standard that assigns a unique number …
UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …
A Unicode code point is the unique number assigned to each character …
Unicode is divided into 17 planes, each containing up to 65,536 code …
The Byte Order Mark (BOM) is a special Unicode character used at …
Surrogate pairs are a mechanism in UTF-16 that allows code points outside …
ASCII defined 128 characters for the English alphabet and was the foundation …
The same visible character can be represented by multiple different byte sequences …
The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …
Every Unicode character belongs to a general category such as Letter, Number, …
Unicode blocks are contiguous ranges of code points grouped by script or …
Unicode assigns every character to a script property that identifies the writing …
Combining characters are Unicode code points that attach to a preceding base …
A single visible character on screen — called a grapheme cluster — …
Unicode confusables are characters that look identical or nearly identical to others, …
Zero-width characters are invisible Unicode code points that affect text layout, joining, …
Unicode defines over two dozen whitespace characters beyond the ordinary space, including …
Unicode began in 1987 as a collaboration between engineers at Apple and …
Unicode has released major versions regularly since 1.0 in 1991, with each …