UTF-8 vs UTF-16 vs UTF-32: When to Use Each
UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different trade-offs in size, speed, and compatibility. Learn the key differences and when to choose each encoding for your application.
Unicode defines what characters exist and what numbers (code points) they have. But storing those numbers as bytes requires an encoding form. The Unicode Standard defines three: UTF-8, UTF-16, and UTF-32. Each makes different trade-offs between space efficiency, simplicity, and compatibility. Choosing the wrong one can waste memory, break APIs, or introduce subtle security bugs. This guide explains how each encoding works and when to reach for each one.
The Core Difference at a Glance
| Property | UTF-8 | UTF-16 | UTF-32 |
|---|---|---|---|
| Unit size | 8 bits (1 byte) | 16 bits (2 bytes) | 32 bits (4 bytes) |
| Code units per character | 1–4 | 1–2 | Always 1 |
| BMP characters | 1–3 bytes | 2 bytes | 4 bytes |
| Supplementary (emoji, rare) | 4 bytes | 4 bytes (surrogate pair) | 4 bytes |
| ASCII size | 1 byte | 2 bytes | 4 bytes |
| ASCII compatible | Yes | No | No |
| Endianness issues | None | Yes (LE/BE) | Yes (LE/BE) |
| Null bytes for ASCII | None | Yes | Yes |
| Self-synchronizing | Yes | Partial | Yes |
| BOM required | No | Recommended | Recommended |
UTF-8: Variable-Width, ASCII-Compatible
UTF-8 uses 1 to 4 bytes per code point. Code points U+0000–U+007F (ASCII) use exactly 1 byte, making any ASCII file simultaneously valid UTF-8. Higher code points use 2, 3, or 4 bytes.
"A".encode("utf-8") # b'\x41' — 1 byte
"é".encode("utf-8") # b'\xc3\xa9' — 2 bytes
"☃".encode("utf-8") # b'\xe2\x98\x83' — 3 bytes
"😀".encode("utf-8") # b'\xf0\x9f\x98\x80' — 4 bytes
Best for: Files, HTTP, databases, APIs, source code, anything involving I/O or network transmission — which is most things.
UTF-16: Fixed/Variable-Width, Surrogate Pairs
UTF-16 uses one or two 16-bit code units (2 or 4 bytes) per character: - Code points U+0000–U+FFFF (BMP): one 16-bit unit (2 bytes). - Code points U+10000–U+10FFFF (supplementary): surrogate pair — two 16-bit units (4 bytes total).
Surrogate pairs work by splitting the 20-bit offset of the supplementary character into two 10-bit halves, encoding each in a reserved range: - High surrogate: U+D800–U+DBFF - Low surrogate: U+DC00–U+DFFF
"A".encode("utf-16-le") # b'\x41\x00' — 2 bytes
"é".encode("utf-16-le") # b'\xe9\x00' — 2 bytes
"😀".encode("utf-16-le") # b'\x3d\xd8\x00\xde' — 4 bytes (surrogate pair)
Endianness is a critical issue with UTF-16. The same string can be stored as UTF-16LE
(little-endian, most common on Windows) or UTF-16BE (big-endian). A Byte Order Mark
(U+FEFF, FF FE for LE, FE FF for BE) at the start of a file signals which variant is in use.
Without a BOM, the receiver must be told the byte order out-of-band.
Best for: Windows API calls, Java String, JavaScript engine internals, file formats
originated in Windows (e.g., .docx internal XML is UTF-8, but Win32 APIs are UTF-16).
The Surrogate Pair Gotcha in JavaScript
JavaScript strings are UTF-16. Characters outside the BMP consume two code units, which means
.length counts code units, not characters:
const snowman = "☃"; // U+2603 — BMP
const emoji = "😀"; // U+1F600 — supplementary
console.log(snowman.length); // 1 — one code unit
console.log(emoji.length); // 2 — surrogate pair
// Correct iteration: use spread or for...of (iterates by code point)
console.log([...emoji].length); // 1
for (const char of emoji) {
console.log(char); // "😀" as a single character
}
// String.prototype.codePointAt vs charCodeAt
console.log(emoji.charCodeAt(0)); // 55357 — high surrogate (wrong)
console.log(emoji.codePointAt(0)); // 128512 — U+1F600 (correct)
This is a frequent source of bugs in Node.js applications that process user-supplied text containing emoji.
UTF-32: Fixed-Width, Simple but Bulky
UTF-32 uses exactly 4 bytes for every code point, regardless of value. This makes it the only
encoding where random-access indexing is O(1) — the character at position n is always at byte
offset n × 4.
"A".encode("utf-32-le") # b'\x41\x00\x00\x00' — 4 bytes
"é".encode("utf-32-le") # b'\xe9\x00\x00\x00' — 4 bytes
"😀".encode("utf-32-le") # b'\x00\xf6\x01\x00' — 4 bytes
The trade-off is space: an ASCII string in UTF-32 consumes 4× the memory compared to UTF-8. A 1 MB English text file becomes a 4 MB UTF-32 file.
Like UTF-16, UTF-32 has byte order variants (UTF-32LE, UTF-32BE) and typically uses a BOM.
Best for: Internal string processing in C/C++ code that needs O(1) character indexing and can
afford the memory cost. Python's str object uses a variable internal representation
(Latin-1, UCS-2, or UCS-4 per PEP 393) based on the highest code point in the string, which is
effectively a per-string optimized version of UTF-32.
Side-by-Side Encoding of "Hello, 世界!"
import sys
text = "Hello, 世界!"
for enc in ["utf-8", "utf-16-le", "utf-32-le"]:
b = text.encode(enc)
print(f"{enc:12s}: {len(b):3d} bytes {b.hex()}")
Output:
utf-8 : 14 bytes 48656c6c6f2c20e4b896e7958c21
utf-16-le : 20 bytes 480065006c006c006f002c002000164e4e754100
utf-32-le : 40 bytes 480000006c0000006c000000...
The two CJK characters (U+4E16 世, U+754C 界) are 3 bytes each in UTF-8 but only 2 bytes each in UTF-16 — because they're BMP characters. UTF-8 is more efficient for ASCII; UTF-16 is more efficient for CJK in the BMP.
Performance Trade-offs
String Length
- UTF-32:
O(1)— length in code points = byte length / 4. - UTF-8 / UTF-16:
O(n)— must scan to count code points (bytes/units don't map 1:1).
Most languages cache string length, so this matters mainly when constructing strings from raw bytes.
Random Access by Code Point Index
- UTF-32:
O(1)— byte offset =index × 4. - UTF-8 / UTF-16:
O(n)— must scan from start (or from a cached checkpoint).
This is why text editors that do a lot of cursor movement sometimes maintain a gap buffer or piece table in UTF-32 internally, even while storing files as UTF-8.
Comparison and Search
All three encodings support efficient byte-level search for fixed patterns (e.g., Boyer–Moore), but UTF-8 and UTF-16 require care: searching for a multi-byte character must not accidentally match partial sequences. Well-designed UTF-8 is immune to false matches between byte sequences representing different characters, because the encoding structure prevents ambiguity.
When to Use Each
| Scenario | Recommended | Reason |
|---|---|---|
| Web (HTTP, JSON, HTML) | UTF-8 | ASCII compat, no null bytes, universal support |
| Files on disk | UTF-8 | Compact, readable by all modern tools |
| Databases (PostgreSQL, MySQL) | UTF-8 | UTF8 / utf8mb4 are standard |
| Windows API calls | UTF-16 | Win32 uses WCHAR (UTF-16LE) natively |
Java String internal |
UTF-16 | JVM internal representation |
| JavaScript string processing | UTF-16 aware | V8/SpiderMonkey are UTF-16 internally |
| C/C++ text processing kernel | UTF-32 | O(1) indexing, simplicity |
| Network protocols | UTF-8 | Null safety, ASCII compat, no endian ambiguity |
Python 3 str |
Automatic | CPython auto-selects Latin-1/UCS-2/UCS-4 per string |
Endianness in Practice
UTF-16 and UTF-32 are affected by the CPU's byte order. Intel x86/x64 processors are little-endian (LE); many network protocols use big-endian (BE, a.k.a. "network byte order").
The BOM (U+FEFF) at the start of a file tells the reader which byte order to expect:
UTF-16 LE BOM: FF FE
UTF-16 BE BOM: FE FF
UTF-32 LE BOM: FF FE 00 00
UTF-32 BE BOM: 00 00 FE FF
When exchanging UTF-16 data between systems with different CPU architectures, always include or agree on the byte order. In practice, UTF-16LE is the dominant variant on Windows; protocols often specify UTF-16BE.
text = "Hello"
# With explicit byte order (no BOM)
print(text.encode("utf-16-le").hex()) # 480065006c006c006f00
print(text.encode("utf-16-be").hex()) # 00480065006c006c006f
# With BOM (Python adds it when using "utf-16" without -le/-be)
print(text.encode("utf-16").hex()) # fffe480065006c006c006f00
Key Takeaways
- UTF-8 is the right default for nearly everything: files, APIs, databases, web. It's ASCII compatible, null-byte safe, and endianness-neutral.
- UTF-16 is primarily used in Windows APIs, Java, and JavaScript internals — you often can't avoid it in those contexts, but you shouldn't choose it for new data formats.
- UTF-32 offers O(1) character indexing at a 4× memory cost — a niche tool for internal text processing that requires constant-time access.
- Surrogate pairs in UTF-16 are a common source of bugs in JavaScript when strings contain
characters outside the BMP (most emoji). Always use
codePointAt()andfor...ofinstead ofcharCodeAt()and index-based loops. - When in doubt: use UTF-8.
More in Unicode Fundamentals
Unicode is the universal character encoding standard that assigns a unique number …
UTF-8 is the dominant character encoding on the web, capable of representing …
A Unicode code point is the unique number assigned to each character …
Unicode is divided into 17 planes, each containing up to 65,536 code …
The Byte Order Mark (BOM) is a special Unicode character used at …
Surrogate pairs are a mechanism in UTF-16 that allows code points outside …
ASCII defined 128 characters for the English alphabet and was the foundation …
The same visible character can be represented by multiple different byte sequences …
The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …
Every Unicode character belongs to a general category such as Letter, Number, …
Unicode blocks are contiguous ranges of code points grouped by script or …
Unicode assigns every character to a script property that identifies the writing …
Combining characters are Unicode code points that attach to a preceding base …
A single visible character on screen — called a grapheme cluster — …
Unicode confusables are characters that look identical or nearly identical to others, …
Zero-width characters are invisible Unicode code points that affect text layout, joining, …
Unicode defines over two dozen whitespace characters beyond the ordinary space, including …
Unicode began in 1987 as a collaboration between engineers at Apple and …
Unicode has released major versions regularly since 1.0 in 1991, with each …