UTF-8
每字符使用1至4字节的可变长度Unicode编码,是Web的主流编码(超过98%的网站),并与ASCII完全向后兼容。
What is UTF-8?
UTF-8 (Unicode Transformation Format — 8-bit) is a variable-length character encoding for Unicode. It represents each Unicode code point using 1 to 4 bytes, with a clever design that makes it fully backward-compatible with ASCII and self-synchronizing. As of 2024, over 98% of websites use UTF-8, making it the universal default for text on the internet.
UTF-8 was designed by Ken Thompson and Rob Pike in September 1992. The design goals were ambitious: encode all Unicode code points, maintain ASCII backward compatibility, be self-synchronizing (so you can determine character boundaries without reading from the start), and be space-efficient for Latin-script text.
How UTF-8 Works
The encoding uses a variable number of bytes based on the code point value:
| Code Point Range | Bytes | Bit Pattern |
|---|---|---|
| U+0000–U+007F | 1 | 0xxxxxxx |
| U+0080–U+07FF | 2 | 110xxxxx 10xxxxxx |
| U+0800–U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx |
| U+10000–U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
The leading bits of the first byte encode the byte count: 0 means 1 byte, 110 means 2 bytes, 1110 means 3 bytes, 11110 means 4 bytes. Continuation bytes always start with 10, making them immediately distinguishable from start bytes.
Example: encoding U+00E9 (é, Latin small letter e with acute)
U+00E9 = 0xE9 = 233, which falls in the U+0080–U+07FF range (2 bytes).
Binary of 0xE9: 11101001
Split into 5+6 bits: 00011 | 101001
Apply pattern 110xxxxx 10xxxxxx: 11000011 10101001 = 0xC3 0xA9
>>> 'é'.encode('utf-8')
b'\xc3\xa9'
>>> b'\xc3\xa9'.decode('utf-8')
'é'
Self-Synchronization
One of UTF-8's most important properties is that you can determine character boundaries without reading from the start of the stream. Any byte starting with 10xxxxxx is a continuation byte; any other byte begins a new character. If you're dropped into the middle of a UTF-8 stream, you can scan forward until you find a non-continuation byte and know you've found a character boundary.
This property also enables robust error recovery: if a byte is corrupted, the damage is local to that character, not propagated through the rest of the stream.
Code Examples
# Python 3 — strings are Unicode by default
text = 'Hello, 世界!'
encoded = text.encode('utf-8')
print(encoded)
# b'Hello, \xe4\xb8\x96\xe7\x95\x8c!'
# ASCII characters stay as single bytes
print(len('A'.encode('utf-8'))) # 1
print(len('é'.encode('utf-8'))) # 2
print(len('中'.encode('utf-8'))) # 3
print(len('𠀀'.encode('utf-8'))) # 4 (rare CJK extension)
# Reading files: always declare encoding
with open('data.txt', 'r', encoding='utf-8') as f:
content = f.read()
// Node.js: Buffer handles UTF-8 by default
const buf = Buffer.from('Hello, 世界!', 'utf-8');
console.log(buf.byteLength); // 14 (7 ASCII + 6 CJK bytes + 1 ! byte)
// TextEncoder/TextDecoder in browser and Node.js
const encoder = new TextEncoder(); // always UTF-8
const bytes = encoder.encode('é');
console.log(bytes); // Uint8Array [195, 169]
Quick Facts
| Property | Value |
|---|---|
| Full Name | Unicode Transformation Format — 8-bit |
| Designed by | Ken Thompson, Rob Pike (1992) |
| Bytes per character | 1–4 |
| ASCII compatible | Yes (U+0000–U+007F identical) |
| Web adoption | ~98% of websites (2024) |
| BOM | Optional (U+FEFF = EF BB BF), not recommended |
| Self-synchronizing | Yes |
| Standard | RFC 3629, Unicode Standard |
Common Pitfalls
Confusing byte length with character length. In Python 3, len('中文') returns 2 (characters), but len('中文'.encode('utf-8')) returns 6 (bytes). Always know whether you're counting characters or bytes.
Assuming one character = one code point. Some characters are made of multiple code points combined (e.g., a base letter + combining diacritical mark). len('é') can be 1 or 2 depending on whether it's NFC or NFD normalized. This is a Unicode normalization issue, not a UTF-8 issue per se.
Opening UTF-8 files without specifying encoding. On Windows, open('file.txt', 'r') defaults to the system code page (often Windows-1252). Always pass encoding='utf-8' explicitly.
The "UTF-8 with BOM" variant. Windows Notepad historically saved UTF-8 files with a 3-byte BOM (EF BB BF). Many parsers fail on this. Prefer UTF-8 without BOM for interchange.
相关术语
编码 中的更多内容
美国信息交换标准代码。7位编码,涵盖128个字符(0–127),包括控制字符、数字、拉丁字母和基本符号。
Visual art created from text characters, originally limited to the 95 printable …
Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …
主要在台湾和香港使用的繁体中文字符编码,收录约13,000个CJK字符。
扩展二进制编码十进制交换码。IBM大型机编码,字母范围不连续,至今仍用于银行和企业大型机。
基于KS X 1001的韩语字符编码,将韩文音节和汉字映射为双字节序列。
简体中文字符编码系列:GB2312(6,763字)经GBK演化为GB18030,成为与Unicode兼容的中国强制性国家标准。
由IANA维护的字符编码名称官方注册表,用于HTTP Content-Type头和MIME(如charset=utf-8)。
针对不同语言组的8位单字节编码系列,ISO 8859-1(Latin-1)是Unicode前256个码位的基础。
将单字节ASCII/JIS罗马字与双字节JIS X 0208汉字相结合的日语字符编码,仍在传统日语系统中使用。