UTF-8
Embed This Widget
Add the script tag and a data attribute to embed this widget.
Embed via iframe for maximum compatibility.
<iframe src="https://unicodefyi.com/iframe/glossary/utf-8/" width="420" height="400" frameborder="0" style="border:0;border-radius:10px;max-width:100%" loading="lazy"></iframe>
Paste this URL in WordPress, Medium, or any oEmbed-compatible platform.
https://unicodefyi.com/glossary/utf-8/
Add a dynamic SVG badge to your README or docs.
[](https://unicodefyi.com/glossary/utf-8/)
Use the native HTML custom element.
Mã hóa Unicode có độ dài thay đổi sử dụng 1–4 byte cho mỗi ký tự. Định dạng mã hóa phổ biến nhất trên web (hơn 98% trang web) và tương thích ngược hoàn toàn với ASCII.
What is UTF-8?
UTF-8 (Unicode Transformation Format — 8-bit) is a variable-length character encoding for Unicode. It represents each Unicode code point using 1 to 4 bytes, with a clever design that makes it fully backward-compatible with ASCII and self-synchronizing. As of 2024, over 98% of websites use UTF-8, making it the universal default for text on the internet.
UTF-8 was designed by Ken Thompson and Rob Pike in September 1992. The design goals were ambitious: encode all Unicode code points, maintain ASCII backward compatibility, be self-synchronizing (so you can determine character boundaries without reading from the start), and be space-efficient for Latin-script text.
How UTF-8 Works
The encoding uses a variable number of bytes based on the code point value:
| Code Point Range | Bytes | Bit Pattern |
|---|---|---|
| U+0000–U+007F | 1 | 0xxxxxxx |
| U+0080–U+07FF | 2 | 110xxxxx 10xxxxxx |
| U+0800–U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx |
| U+10000–U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
The leading bits of the first byte encode the byte count: 0 means 1 byte, 110 means 2 bytes, 1110 means 3 bytes, 11110 means 4 bytes. Continuation bytes always start with 10, making them immediately distinguishable from start bytes.
Example: encoding U+00E9 (é, Latin small letter e with acute)
U+00E9 = 0xE9 = 233, which falls in the U+0080–U+07FF range (2 bytes).
Binary of 0xE9: 11101001
Split into 5+6 bits: 00011 | 101001
Apply pattern 110xxxxx 10xxxxxx: 11000011 10101001 = 0xC3 0xA9
>>> 'é'.encode('utf-8')
b'\xc3\xa9'
>>> b'\xc3\xa9'.decode('utf-8')
'é'
Self-Synchronization
One of UTF-8's most important properties is that you can determine character boundaries without reading from the start of the stream. Any byte starting with 10xxxxxx is a continuation byte; any other byte begins a new character. If you're dropped into the middle of a UTF-8 stream, you can scan forward until you find a non-continuation byte and know you've found a character boundary.
This property also enables robust error recovery: if a byte is corrupted, the damage is local to that character, not propagated through the rest of the stream.
Code Examples
# Python 3 — strings are Unicode by default
text = 'Hello, 世界!'
encoded = text.encode('utf-8')
print(encoded)
# b'Hello, \xe4\xb8\x96\xe7\x95\x8c!'
# ASCII characters stay as single bytes
print(len('A'.encode('utf-8'))) # 1
print(len('é'.encode('utf-8'))) # 2
print(len('中'.encode('utf-8'))) # 3
print(len('𠀀'.encode('utf-8'))) # 4 (rare CJK extension)
# Reading files: always declare encoding
with open('data.txt', 'r', encoding='utf-8') as f:
content = f.read()
// Node.js: Buffer handles UTF-8 by default
const buf = Buffer.from('Hello, 世界!', 'utf-8');
console.log(buf.byteLength); // 14 (7 ASCII + 6 CJK bytes + 1 ! byte)
// TextEncoder/TextDecoder in browser and Node.js
const encoder = new TextEncoder(); // always UTF-8
const bytes = encoder.encode('é');
console.log(bytes); // Uint8Array [195, 169]
Quick Facts
| Property | Value |
|---|---|
| Full Name | Unicode Transformation Format — 8-bit |
| Designed by | Ken Thompson, Rob Pike (1992) |
| Bytes per character | 1–4 |
| ASCII compatible | Yes (U+0000–U+007F identical) |
| Web adoption | ~98% of websites (2024) |
| BOM | Optional (U+FEFF = EF BB BF), not recommended |
| Self-synchronizing | Yes |
| Standard | RFC 3629, Unicode Standard |
Common Pitfalls
Confusing byte length with character length. In Python 3, len('中文') returns 2 (characters), but len('中文'.encode('utf-8')) returns 6 (bytes). Always know whether you're counting characters or bytes.
Assuming one character = one code point. Some characters are made of multiple code points combined (e.g., a base letter + combining diacritical mark). len('é') can be 1 or 2 depending on whether it's NFC or NFD normalized. This is a Unicode normalization issue, not a UTF-8 issue per se.
Opening UTF-8 files without specifying encoding. On Windows, open('file.txt', 'r') defaults to the system code page (often Windows-1252). Always pass encoding='utf-8' explicitly.
The "UTF-8 with BOM" variant. Windows Notepad historically saved UTF-8 files with a 3-byte BOM (EF BB BF). Many parsers fail on this. Prefer UTF-8 without BOM for interchange.
Thuật ngữ liên quan
Thêm trong Mã hóa
Tiêu chuẩn mã hóa thông tin Mỹ (American Standard Code for Information …
Visual art created from text characters, originally limited to the 95 printable …
Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …
Mã hóa chữ Hán phồn thể được sử dụng chủ yếu ở …
Sổ đăng ký chính thức tên mã hóa ký tự do IANA …
U+FEFF được đặt ở đầu luồng văn bản để chỉ ra thứ …
Mã trao đổi dữ liệu thập phân được mã hóa nhị phân …
Mã hóa ký tự tiếng Hàn dựa trên KS X 1001, ánh …
Họ mã hóa ký tự chữ Hán giản thể: GB2312 (6.763 ký …
Họ các mã hóa đơn byte 8-bit dành cho các nhóm ngôn …