📚 Unicode Fundamentals

UTF-8 vs UTF-16 vs UTF-32: When to Use Each

UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different trade-offs in size, speed, and compatibility. Learn the key differences and when to choose each encoding for your application.

·

Unicode defines what characters exist and what numbers (code points) they have. But storing those numbers as bytes requires an encoding form. The Unicode Standard defines three: UTF-8, UTF-16, and UTF-32. Each makes different trade-offs between space efficiency, simplicity, and compatibility. Choosing the wrong one can waste memory, break APIs, or introduce subtle security bugs. This guide explains how each encoding works and when to reach for each one.

The Core Difference at a Glance

Property UTF-8 UTF-16 UTF-32
Unit size 8 bits (1 byte) 16 bits (2 bytes) 32 bits (4 bytes)
Code units per character 1–4 1–2 Always 1
BMP characters 1–3 bytes 2 bytes 4 bytes
Supplementary (emoji, rare) 4 bytes 4 bytes (surrogate pair) 4 bytes
ASCII size 1 byte 2 bytes 4 bytes
ASCII compatible Yes No No
Endianness issues None Yes (LE/BE) Yes (LE/BE)
Null bytes for ASCII None Yes Yes
Self-synchronizing Yes Partial Yes
BOM required No Recommended Recommended

UTF-8: Variable-Width, ASCII-Compatible

UTF-8 uses 1 to 4 bytes per code point. Code points U+0000–U+007F (ASCII) use exactly 1 byte, making any ASCII file simultaneously valid UTF-8. Higher code points use 2, 3, or 4 bytes.

"A".encode("utf-8")        # b'\x41'         — 1 byte
"é".encode("utf-8")        # b'\xc3\xa9'    — 2 bytes
"☃".encode("utf-8")        # b'\xe2\x98\x83' — 3 bytes
"😀".encode("utf-8")       # b'\xf0\x9f\x98\x80' — 4 bytes

Best for: Files, HTTP, databases, APIs, source code, anything involving I/O or network transmission — which is most things.

UTF-16: Fixed/Variable-Width, Surrogate Pairs

UTF-16 uses one or two 16-bit code units (2 or 4 bytes) per character: - Code points U+0000–U+FFFF (BMP): one 16-bit unit (2 bytes). - Code points U+10000–U+10FFFF (supplementary): surrogate pair — two 16-bit units (4 bytes total).

Surrogate pairs work by splitting the 20-bit offset of the supplementary character into two 10-bit halves, encoding each in a reserved range: - High surrogate: U+D800–U+DBFF - Low surrogate: U+DC00–U+DFFF

"A".encode("utf-16-le")    # b'\x41\x00'          — 2 bytes
"é".encode("utf-16-le")    # b'\xe9\x00'          — 2 bytes
"😀".encode("utf-16-le")   # b'\x3d\xd8\x00\xde' — 4 bytes (surrogate pair)

Endianness is a critical issue with UTF-16. The same string can be stored as UTF-16LE (little-endian, most common on Windows) or UTF-16BE (big-endian). A Byte Order Mark (U+FEFF, FF FE for LE, FE FF for BE) at the start of a file signals which variant is in use. Without a BOM, the receiver must be told the byte order out-of-band.

Best for: Windows API calls, Java String, JavaScript engine internals, file formats originated in Windows (e.g., .docx internal XML is UTF-8, but Win32 APIs are UTF-16).

The Surrogate Pair Gotcha in JavaScript

JavaScript strings are UTF-16. Characters outside the BMP consume two code units, which means .length counts code units, not characters:

const snowman = "☃";       // U+2603 — BMP
const emoji   = "😀";      // U+1F600 — supplementary

console.log(snowman.length);   // 1 — one code unit
console.log(emoji.length);     // 2 — surrogate pair

// Correct iteration: use spread or for...of (iterates by code point)
console.log([...emoji].length); // 1
for (const char of emoji) {
    console.log(char);  // "😀" as a single character
}

// String.prototype.codePointAt vs charCodeAt
console.log(emoji.charCodeAt(0));    // 55357  — high surrogate (wrong)
console.log(emoji.codePointAt(0));   // 128512 — U+1F600 (correct)

This is a frequent source of bugs in Node.js applications that process user-supplied text containing emoji.

UTF-32: Fixed-Width, Simple but Bulky

UTF-32 uses exactly 4 bytes for every code point, regardless of value. This makes it the only encoding where random-access indexing is O(1) — the character at position n is always at byte offset n × 4.

"A".encode("utf-32-le")    # b'\x41\x00\x00\x00' — 4 bytes
"é".encode("utf-32-le")    # b'\xe9\x00\x00\x00' — 4 bytes
"😀".encode("utf-32-le")   # b'\x00\xf6\x01\x00' — 4 bytes

The trade-off is space: an ASCII string in UTF-32 consumes 4× the memory compared to UTF-8. A 1 MB English text file becomes a 4 MB UTF-32 file.

Like UTF-16, UTF-32 has byte order variants (UTF-32LE, UTF-32BE) and typically uses a BOM.

Best for: Internal string processing in C/C++ code that needs O(1) character indexing and can afford the memory cost. Python's str object uses a variable internal representation (Latin-1, UCS-2, or UCS-4 per PEP 393) based on the highest code point in the string, which is effectively a per-string optimized version of UTF-32.

Side-by-Side Encoding of "Hello, 世界!"

import sys

text = "Hello, 世界!"

for enc in ["utf-8", "utf-16-le", "utf-32-le"]:
    b = text.encode(enc)
    print(f"{enc:12s}: {len(b):3d} bytes  {b.hex()}")

Output:

utf-8       :  14 bytes  48656c6c6f2c20e4b896e7958c21
utf-16-le   :  20 bytes  480065006c006c006f002c002000164e4e754100
utf-32-le   :  40 bytes  480000006c0000006c000000...

The two CJK characters (U+4E16 世, U+754C 界) are 3 bytes each in UTF-8 but only 2 bytes each in UTF-16 — because they're BMP characters. UTF-8 is more efficient for ASCII; UTF-16 is more efficient for CJK in the BMP.

Performance Trade-offs

String Length

  • UTF-32: O(1) — length in code points = byte length / 4.
  • UTF-8 / UTF-16: O(n) — must scan to count code points (bytes/units don't map 1:1).

Most languages cache string length, so this matters mainly when constructing strings from raw bytes.

Random Access by Code Point Index

  • UTF-32: O(1) — byte offset = index × 4.
  • UTF-8 / UTF-16: O(n) — must scan from start (or from a cached checkpoint).

This is why text editors that do a lot of cursor movement sometimes maintain a gap buffer or piece table in UTF-32 internally, even while storing files as UTF-8.

All three encodings support efficient byte-level search for fixed patterns (e.g., Boyer–Moore), but UTF-8 and UTF-16 require care: searching for a multi-byte character must not accidentally match partial sequences. Well-designed UTF-8 is immune to false matches between byte sequences representing different characters, because the encoding structure prevents ambiguity.

When to Use Each

Scenario Recommended Reason
Web (HTTP, JSON, HTML) UTF-8 ASCII compat, no null bytes, universal support
Files on disk UTF-8 Compact, readable by all modern tools
Databases (PostgreSQL, MySQL) UTF-8 UTF8 / utf8mb4 are standard
Windows API calls UTF-16 Win32 uses WCHAR (UTF-16LE) natively
Java String internal UTF-16 JVM internal representation
JavaScript string processing UTF-16 aware V8/SpiderMonkey are UTF-16 internally
C/C++ text processing kernel UTF-32 O(1) indexing, simplicity
Network protocols UTF-8 Null safety, ASCII compat, no endian ambiguity
Python 3 str Automatic CPython auto-selects Latin-1/UCS-2/UCS-4 per string

Endianness in Practice

UTF-16 and UTF-32 are affected by the CPU's byte order. Intel x86/x64 processors are little-endian (LE); many network protocols use big-endian (BE, a.k.a. "network byte order").

The BOM (U+FEFF) at the start of a file tells the reader which byte order to expect:

UTF-16 LE BOM: FF FE
UTF-16 BE BOM: FE FF
UTF-32 LE BOM: FF FE 00 00
UTF-32 BE BOM: 00 00 FE FF

When exchanging UTF-16 data between systems with different CPU architectures, always include or agree on the byte order. In practice, UTF-16LE is the dominant variant on Windows; protocols often specify UTF-16BE.

text = "Hello"

# With explicit byte order (no BOM)
print(text.encode("utf-16-le").hex())   # 480065006c006c006f00
print(text.encode("utf-16-be").hex())   # 00480065006c006c006f

# With BOM (Python adds it when using "utf-16" without -le/-be)
print(text.encode("utf-16").hex())      # fffe480065006c006c006f00

Key Takeaways

  • UTF-8 is the right default for nearly everything: files, APIs, databases, web. It's ASCII compatible, null-byte safe, and endianness-neutral.
  • UTF-16 is primarily used in Windows APIs, Java, and JavaScript internals — you often can't avoid it in those contexts, but you shouldn't choose it for new data formats.
  • UTF-32 offers O(1) character indexing at a 4× memory cost — a niche tool for internal text processing that requires constant-time access.
  • Surrogate pairs in UTF-16 are a common source of bugs in JavaScript when strings contain characters outside the BMP (most emoji). Always use codePointAt() and for...of instead of charCodeAt() and index-based loops.
  • When in doubt: use UTF-8.

More in Unicode Fundamentals

What is Unicode? A Complete Guide

Unicode is the universal character encoding standard that assigns a unique number …

UTF-8 Encoding Explained

UTF-8 is the dominant character encoding on the web, capable of representing …

What is a Unicode Code Point?

A Unicode code point is the unique number assigned to each character …

Unicode Planes and the BMP

Unicode is divided into 17 planes, each containing up to 65,536 code …

Understanding Byte Order Mark (BOM)

The Byte Order Mark (BOM) is a special Unicode character used at …

Surrogate Pairs Explained

Surrogate pairs are a mechanism in UTF-16 that allows code points outside …

ASCII to Unicode: The Evolution of Character Encoding

ASCII defined 128 characters for the English alphabet and was the foundation …

Unicode Normalization: NFC, NFD, NFKC, NFKD

The same visible character can be represented by multiple different byte sequences …

The Unicode Bidirectional Algorithm

The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …

Unicode General Categories Explained

Every Unicode character belongs to a general category such as Letter, Number, …

Understanding Unicode Blocks

Unicode blocks are contiguous ranges of code points grouped by script or …

Unicode Scripts: How Writing Systems are Organized

Unicode assigns every character to a script property that identifies the writing …

What are Combining Characters?

Combining characters are Unicode code points that attach to a preceding base …

Grapheme Clusters vs Code Points

A single visible character on screen — called a grapheme cluster — …

Unicode Confusables: A Security Guide

Unicode confusables are characters that look identical or nearly identical to others, …

Zero Width Characters: What They Are and Why They Matter

Zero-width characters are invisible Unicode code points that affect text layout, joining, …

Unicode Whitespace Characters Guide

Unicode defines over two dozen whitespace characters beyond the ordinary space, including …

History of Unicode

Unicode began in 1987 as a collaboration between engineers at Apple and …

Unicode Versions Timeline

Unicode has released major versions regularly since 1.0 in 1991, with each …