Unicode for the Modern Web · 第3章
JavaScript Strings: The UTF-16 Legacy
JavaScript strings are UTF-16 under the hood, leading to surprising behavior with emoji and supplementary characters. This chapter covers String.fromCodePoint, for...of iteration, TextEncoder/TextDecoder, and the Intl APIs.
JavaScript strings have a dirty secret: they are encoded in UTF-16. This was a reasonable choice in 1995 when Unicode was expected to stay within 65,536 code points — JavaScript inherited the UCS-2 encoding model from Java and was designed before the Basic Multilingual Plane overflow. Today, with emoji living in the supplementary planes (U+10000 and above), every JavaScript developer needs to understand surrogate pairs to write correct string-handling code.
Why JavaScript Strings Are UTF-16
JavaScript string literals are sequences of 16-bit code units, not Unicode code points. For characters in the Basic Multilingual Plane (U+0000–U+FFFF), one code unit equals one code point. For characters above U+FFFF — all emoji, and thousands of CJK Extension B characters — one code point requires two code units: a surrogate pair.
The surrogate pair encoding works as follows: the high surrogate occupies U+D800–U+DBFF and the low surrogate occupies U+DC00–U+DFFF. Together they encode a supplementary code point:
Code point: U+1F600 (😀)
High surrogate: 0xD83D
Low surrogate: 0xDE00
JS string: "\uD83D\uDE00" (2 code units, length = 2)
You can verify this in any browser console:
const emoji = "😀";
console.log(emoji.length); // 2 — two code units!
console.log(emoji.charCodeAt(0)); // 55357 (0xD83D) — high surrogate
console.log(emoji.charCodeAt(1)); // 56832 (0xDE00) — low surrogate
console.log(emoji.codePointAt(0)); // 128512 (0x1F600) — correct!
The .length Trap
The .length property counts code units, not code points, not characters, not grapheme clusters. This causes systematic bugs:
"hello".length // 5 ✓
"café".length // 4 or 5? → 4 if é is U+00E9 (precomposed), 5 if e + combining accent
"😀".length // 2 ✗ (expected 1)
"👨👩👧👦".length // 11 ✗ (family emoji is a ZWJ sequence)
"नमस्ते".length // 6 ✗ (Devanagari with combining vowel signs)
There is no single .length-equivalent that gives "the number of visible characters." What you probably want depends on context:
- Code points: use [...str].length or Array.from(str).length
- Grapheme clusters (visible characters): use Intl.Segmenter
codePointAt() vs charCodeAt()
const str = "😀 hello";
// charCodeAt — returns the 16-bit code unit at index i
str.charCodeAt(0); // 55357 (high surrogate — meaningless alone)
str.charCodeAt(1); // 56832 (low surrogate — meaningless alone)
// codePointAt — returns the full code point at index i
str.codePointAt(0); // 128512 (U+1F600) — correct!
str.codePointAt(1); // 56832 — still the low surrogate (awkward but by spec)
codePointAt() is "aware" of surrogate pairs: when called on a high surrogate, it returns the combined code point. When called on a low surrogate, it returns just that surrogate's value. This means you cannot safely iterate by index; you must skip by the number of code units consumed.
Proper String Iteration
The ES6 string iterator (used by for...of, spread, Array.from) correctly iterates by code point:
const str = "😀 hi";
// Correct — iterates by code point
for (const char of str) {
console.log(char, char.codePointAt(0).toString(16));
}
// 😀 1f600
// 20
// h 68
// i 69
// Array of code points
const codePoints = [...str];
console.log(codePoints.length); // 4 (not 5)
// But still wrong for grapheme clusters:
const flag = "🇺🇸"; // Regional indicators U+1F1FA + U+1F1F8
console.log([...flag].length); // 2 — two code points, one visible flag
Intl.Segmenter for Grapheme Clusters
For the correct "user-perceived character" count, use Intl.Segmenter:
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
function graphemeCount(str) {
return [...segmenter.segment(str)].length;
}
graphemeCount("hello"); // 5
graphemeCount("😀"); // 1
graphemeCount("👨👩👧👦"); // 1 (family ZWJ sequence)
graphemeCount("नमस्ते"); // 4 (4 aksharas in Devanagari)
graphemeCount("🇺🇸"); // 1 (flag = 2 regional indicators = 1 grapheme)
Intl.Segmenter also supports word and sentence segmentation, with locale-specific rules:
const wordSegmenter = new Intl.Segmenter('ja', { granularity: 'word' });
const text = "日本語のテキスト";
const words = [...wordSegmenter.segment(text)]
.filter(s => s.isWordLike)
.map(s => s.segment);
// Japanese word segmentation without spaces
String.fromCodePoint() and String.fromCharCode()
// fromCharCode — only handles BMP (U+0000–U+FFFF)
String.fromCharCode(65); // 'A'
String.fromCharCode(0x1F600); // 'ὠ' (wrong! truncates to 16 bits)
// fromCodePoint — handles full range including supplementary
String.fromCodePoint(65); // 'A'
String.fromCodePoint(0x1F600); // '😀' ✓
String.fromCodePoint(0x1F600, 0x20, 0x1F44D); // '😀 👍'
Always prefer String.fromCodePoint() over String.fromCharCode() in new code.
Unicode Normalization: normalize()
The same visual character can be encoded multiple ways. The letter é can be U+00E9 (precomposed: LATIN SMALL LETTER E WITH ACUTE) or U+0065 + U+0301 (decomposed: e + combining acute accent). These compare unequal with === despite looking identical:
const a = "\u00E9"; // é precomposed
const b = "e\u0301"; // e + combining accent
console.log(a === b); // false!
console.log(a.length); // 1
console.log(b.length); // 2
// Normalize before comparing
a.normalize('NFC') === b.normalize('NFC'); // true — both become U+00E9
a.normalize('NFD') === b.normalize('NFD'); // true — both become e + accent
The four forms:
- NFC — canonical decomposition then composition (preferred for storage/display)
- NFD — canonical decomposition (useful for accent stripping)
- NFKC — compatibility decomposition then composition (collapses fi ligature → fi)
- NFKD — compatibility decomposition
Always normalize strings before comparison, hashing, or database storage. NFC is the right default.
localeCompare() for Sorting
Alphabetical sort differs by language. JavaScript's default Array.sort() compares UTF-16 code units — it puts Z before a and has no notion of accented character ordering:
// Wrong for human-readable sorting
["Ångström", "Äpfel", "Zebra", "apple"].sort();
// ['Zebra', 'apple', 'Ångström', 'Äpfel'] — code point order, not linguistic
// Correct: locale-aware comparison
["Ångström", "Äpfel", "Zebra", "apple"].sort((a, b) =>
a.localeCompare(b, 'sv', { sensitivity: 'base' })
);
// Swedish: ['apple', 'Äpfel', 'Ångström', 'Zebra']
For performance with large arrays, use Intl.Collator to create a reusable comparator:
const collator = new Intl.Collator('de', { sensitivity: 'accent' });
largeArray.sort(collator.compare);
Regex u and v Flags
The u flag (ES6) enables Unicode mode in regular expressions, making the regex engine work with code points rather than code units:
// Without u flag — matches only BMP characters
/./ .test("😀"); // true (matches the high surrogate)
/^.$/ .test("😀"); // false (two code units, not one)
// With u flag — code-point aware
/./u .test("😀"); // true
/^.$/u .test("😀"); // true ✓
// Unicode property escapes (require u flag)
/\\p{Emoji}/u .test("😀"); // true
/\\p{Script=Latin}/u.test("A"); // true
/\\p{Letter}/u .test("中"); // true (CJK is Letter)
/\\p{Number}/u .test("③"); // true (enclosed number)
The v flag (ES2024) extends u with set notation and string properties:
// v flag: set intersection, subtraction, nested classes
/[\\p{Letter}&&\\p{Script=Greek}]/v.test("α"); // true — Greek letters only
/[\\p{ASCII}--\\p{Number}]/v.test("A"); // true — ASCII non-numbers
For any regex that handles user input or multilingual text, use the u or v flag.