Unicode in JavaScript
JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane — including most emoji — are stored as surrogate pairs and cause unexpected string length and indexing behavior. This guide explains Unicode in JavaScript, covering ES6 improvements, the u flag in regex, and Intl APIs.
JavaScript strings are sequences of UTF-16 code units, not Unicode code points. For the vast majority of text this is invisible — every ASCII character and every character in the Basic Multilingual Plane (BMP, U+0000–U+FFFF) fits in exactly one code unit. The complexity begins with emoji, rare CJK extension characters, and mathematical symbols that live above U+FFFF. This guide explains how JavaScript encodes text, how modern APIs bridge the gap, and how to avoid the classic pitfalls.
Strings as UTF-16 Code Unit Sequences
Each string value in JavaScript is backed by a sequence of 16-bit code units
(the internal format defined in the ECMAScript spec). Characters in the BMP map
to one code unit; characters above U+FFFF (the "supplementary planes") use a
surrogate pair of two code units.
const arrow = "→"; // U+2192, BMP — 1 code unit
const snake = "🐍"; // U+1F40D, supplementary — 2 code units
console.log(arrow.length); // 1
console.log(snake.length); // 2 ← surprising for "one character"
// Code unit values
console.log(arrow.charCodeAt(0).toString(16)); // "2192"
console.log(snake.charCodeAt(0).toString(16)); // "d83d" (high surrogate)
console.log(snake.charCodeAt(1).toString(16)); // "dc0d" (low surrogate)
codePointAt and String.fromCodePoint
ES2015 added code-point–aware alternatives to the older charCodeAt /
fromCharCode:
// Code point (not code unit) from a position
console.log("🐍".codePointAt(0)); // 128013
console.log("🐍".codePointAt(0).toString(16)); // "1f40d"
// Round-trip: code point integer → character
const snake = String.fromCodePoint(0x1f40d); // "🐍"
const arrow = String.fromCodePoint(0x2192); // "→"
// Multiple characters at once
const text = String.fromCodePoint(0x48, 0x65, 0x6c, 0x6c, 0x6f); // "Hello"
Always prefer codePointAt / String.fromCodePoint over the older
charCodeAt / fromCharCode when working with supplementary characters.
Iterating Over Characters Correctly
The for...of loop and the spread operator [...str] iterate over code
points, not code units. This is the idiomatic way to handle emoji and
supplementary characters:
const text = "A🐍B";
// Wrong — iterates over code units, splits the surrogate pair
for (let i = 0; i < text.length; i++) {
console.log(text[i]); // "A", "\\ud83d", "\\udc0d", "B"
}
// Correct — iterates over code points
for (const char of text) {
console.log(char); // "A", "🐍", "B"
}
// Spread to array of code points
const chars = [...text]; // ["A", "🐍", "B"]
console.log(chars.length); // 3 ← correct "character count"
Grapheme Clusters
Even code-point iteration is not always enough. A flag emoji like 🇺🇸 is two
code points (U+1F1FA + U+1F1F8), and many skin-tone emoji are multiple code
points joined by a zero-width joiner (U+200D). The Intl.Segmenter API
handles grapheme cluster boundaries correctly:
const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
const graphemes = [...segmenter.segment("👨👩👧")];
console.log(graphemes.length); // 1 — one grapheme cluster
console.log([..."👨👩👧"].length); // 8 — eight code points (with ZWJs)
Escape Sequences
JavaScript supports several Unicode escape syntaxes in string literals and regular expressions:
// \\uXXXX — BMP only (exactly 4 hex digits)
const euro = "\\u20AC"; // "€"
const arrow = "\\u2192"; // "→"
// \\u{XXXXX} — ES2015+, any code point (1–6 hex digits)
const snake = "\\u{1F40D}"; // "🐍"
const text = "\\u{48}\\u{65}\\u{6C}\\u{6C}\\u{6F}"; // "Hello"
// Deprecated: surrogate pair escape (avoid in new code)
const snake2 = "\\uD83D\\uDC0D"; // "🐍" via surrogate pair — messy
TextEncoder and TextDecoder
When you need raw bytes — for example, to send data over a network, write to
a file via the File API, or compute a hash — use TextEncoder and
TextDecoder:
// str → Uint8Array (UTF-8 bytes)
const encoder = new TextEncoder(); // always UTF-8
const bytes = encoder.encode("café");
console.log(bytes); // Uint8Array [99, 97, 102, 195, 169]
// Uint8Array → str
const decoder = new TextDecoder("utf-8"); // or "utf-16le", "latin-1", etc.
const text = decoder.decode(bytes); // "café"
TextEncoder always outputs UTF-8. TextDecoder accepts a wide range of
encodings (all WHATWG Encoding Standard labels).
The Intl APIs
Modern JavaScript ships a comprehensive internationalisation library under
Intl. Key objects:
| API | Purpose | Example |
|---|---|---|
Intl.Collator |
Locale-aware string comparison and sorting | German ä after a |
Intl.Segmenter |
Split strings into graphemes, words, sentences | Count real characters |
Intl.ListFormat |
Locale-aware list formatting | "a, b, and c" |
Intl.NumberFormat |
Format numbers (currencies, percentages) | "€42,00" in German |
// Sorting German strings correctly
const words = ["Äpfel", "Orangen", "Bananen"];
words.sort(new Intl.Collator("de").compare);
// ["Äpfel", "Bananen", "Orangen"] — Ä treated like A in German
// Word segmentation
const seg = new Intl.Segmenter("en", { granularity: "word" });
const words2 = [...seg.segment("Hello, world!")].filter(s => s.isWordLike);
// [{segment: "Hello"}, {segment: "world"}]
Surrogate Pairs — Deep Dive
Surrogates are the source of many subtle JavaScript bugs:
const snake = "🐍";
// slice() works on code units — can split a surrogate pair
snake.slice(0, 1); // "\\uD83D" ← broken half of surrogate pair
// substring() — same issue
snake.substring(0, 1); // "\\uD83D"
// Safe: use Array.from or spread
Array.from(snake).slice(0, 1); // ["🐍"] ← correct
[...snake][0]; // "🐍"
// Checking for lone surrogates
function hasLoneSurrogate(str) {
return /[\\uD800-\\uDFFF]/.test(str);
}
ES2024 introduced String.prototype.isWellFormed() and toWellFormed() to
detect and fix lone surrogates:
"\\uD800".isWellFormed(); // false — lone high surrogate
"\\uD800".toWellFormed(); // "\\uFFFD" — replaced with REPLACEMENT CHARACTER
"Hello 🐍".isWellFormed(); // true
Regular Expressions and the /u Flag
Without the u flag, regex patterns match UTF-16 code units. With /u, they
match Unicode code points. This matters for characters above U+FFFF:
// Without /u — treats surrogate pair as two separate characters
/^.$/.test("🐍"); // false (emoji has .length === 2)
// With /u — treats emoji as one character
/^.$/u.test("🐍"); // true
// \\p{} Unicode property escapes require /u or /v
/\\p{Emoji}/u.test("🐍"); // true
/\\p{Script=Latin}/u.test("A"); // true
// /v flag (ES2024) — superset of /u with set notation
/[\\p{Letter}&&\\p{ASCII}]/v.test("A"); // true — intersection
Always use the u flag (or v in ES2024+) in new regular expressions.
Common Pitfalls Summary
| Pitfall | Problem | Fix |
|---|---|---|
str.length for character count |
Returns code unit count | [...str].length or Intl.Segmenter |
str[i] indexing emoji |
Returns lone surrogate | [...str][i] |
Regex without /u |
Treats emoji as two chars | Add /u flag |
charCodeAt for all chars |
Fails above U+FFFF | Use codePointAt |
String.fromCharCode for emoji |
Needs surrogate pair | Use String.fromCodePoint |
Sorting with < / > |
Code point order, not locale | Intl.Collator |
Quick Reference
// Code point of a character
"→".codePointAt(0) // 8594
"→".codePointAt(0).toString(16) // "2192"
// Character from code point
String.fromCodePoint(0x2192) // "→"
String.fromCodePoint(0x1F40D) // "🐍"
// True character count
[..."café🐍"].length // 5
// Iterate safely
for (const ch of "café🐍") { ... }
// Encode to bytes
new TextEncoder().encode("café")
// Decode from bytes
new TextDecoder().decode(uint8array)
// Locale-aware sort
arr.sort(new Intl.Collator("en").compare)
// Regex with Unicode support
/\\p{Letter}+/u.exec("café")
Understanding the UTF-16 internals of JavaScript strings is essential for
writing robust internationalized applications. The modern APIs — codePointAt,
String.fromCodePoint, for...of, Intl.Segmenter, and the /u regex flag
— give you the tools to handle all of Unicode correctly.
Plus dans Unicode in Code
Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …
Java's char type is a 16-bit UTF-16 code unit, not a full …
Go's string type is a sequence of bytes, and its rune type …
Rust's str and String types are guaranteed to be valid UTF-8, making …
C and C++ have historically poor Unicode support, with char being a …
Ruby strings carry an explicit encoding, with UTF-8 being the default since …
PHP's built-in string functions operate on bytes rather than Unicode characters, which …
Swift's String type is designed with Unicode correctness as a first-class concern, …
HTML and CSS support Unicode characters directly and through escape sequences, allowing …
Unicode-aware regular expressions let you match characters by script, category, or property …
SQL databases store text in encodings and collations that determine how characters …
URLs are technically restricted to ASCII characters, so non-ASCII text must be …
Every major programming language has its own syntax for embedding Unicode characters …
JSON is defined as Unicode text and must be encoded in UTF-8, …