Unicode for the Modern Web · الفصل 3

JavaScript Strings: The UTF-16 Legacy

JavaScript strings are UTF-16 under the hood, leading to surprising behavior with emoji and supplementary characters. This chapter covers String.fromCodePoint, for...of iteration, TextEncoder/TextDecoder, and the Intl APIs.

~4000 كلمة · ~16 دقيقة قراءة · · Updated

JavaScript strings have a dirty secret: they are encoded in UTF-16. This was a reasonable choice in 1995 when Unicode was expected to stay within 65,536 code points — JavaScript inherited the UCS-2 encoding model from Java and was designed before the Basic Multilingual Plane overflow. Today, with emoji living in the supplementary planes (U+10000 and above), every JavaScript developer needs to understand surrogate pairs to write correct string-handling code.

Why JavaScript Strings Are UTF-16

JavaScript string literals are sequences of 16-bit code units, not Unicode code points. For characters in the Basic Multilingual Plane (U+0000–U+FFFF), one code unit equals one code point. For characters above U+FFFF — all emoji, and thousands of CJK Extension B characters — one code point requires two code units: a surrogate pair.

The surrogate pair encoding works as follows: the high surrogate occupies U+D800–U+DBFF and the low surrogate occupies U+DC00–U+DFFF. Together they encode a supplementary code point:

Code point: U+1F600 (😀)
  High surrogate: 0xD83D
  Low surrogate:  0xDE00
JS string:  "\uD83D\uDE00"  (2 code units, length = 2)

You can verify this in any browser console:

const emoji = "😀";
console.log(emoji.length);        // 2 — two code units!
console.log(emoji.charCodeAt(0)); // 55357 (0xD83D) — high surrogate
console.log(emoji.charCodeAt(1)); // 56832 (0xDE00) — low surrogate
console.log(emoji.codePointAt(0)); // 128512 (0x1F600) — correct!

The .length Trap

The .length property counts code units, not code points, not characters, not grapheme clusters. This causes systematic bugs:

"hello".length    // 5 ✓
"café".length     // 4 or 5? → 4 if é is U+00E9 (precomposed), 5 if e + combining accent
"😀".length       // 2 ✗ (expected 1)
"👨‍👩‍👧‍👦".length  // 11 ✗ (family emoji is a ZWJ sequence)
"नमस्ते".length  // 6 ✗ (Devanagari with combining vowel signs)

There is no single .length-equivalent that gives "the number of visible characters." What you probably want depends on context: - Code points: use [...str].length or Array.from(str).length - Grapheme clusters (visible characters): use Intl.Segmenter

codePointAt() vs charCodeAt()

const str = "😀 hello";

// charCodeAt — returns the 16-bit code unit at index i
str.charCodeAt(0);  // 55357 (high surrogate — meaningless alone)
str.charCodeAt(1);  // 56832 (low surrogate — meaningless alone)

// codePointAt — returns the full code point at index i
str.codePointAt(0); // 128512 (U+1F600) — correct!
str.codePointAt(1); // 56832 — still the low surrogate (awkward but by spec)

codePointAt() is "aware" of surrogate pairs: when called on a high surrogate, it returns the combined code point. When called on a low surrogate, it returns just that surrogate's value. This means you cannot safely iterate by index; you must skip by the number of code units consumed.

Proper String Iteration

The ES6 string iterator (used by for...of, spread, Array.from) correctly iterates by code point:

const str = "😀 hi";

// Correct — iterates by code point
for (const char of str) {
  console.log(char, char.codePointAt(0).toString(16));
}
// 😀  1f600
//     20
// h   68
// i   69

// Array of code points
const codePoints = [...str];
console.log(codePoints.length); // 4 (not 5)

// But still wrong for grapheme clusters:
const flag = "🇺🇸"; // Regional indicators U+1F1FA + U+1F1F8
console.log([...flag].length); // 2 — two code points, one visible flag

Intl.Segmenter for Grapheme Clusters

For the correct "user-perceived character" count, use Intl.Segmenter:

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });

function graphemeCount(str) {
  return [...segmenter.segment(str)].length;
}

graphemeCount("hello");       // 5
graphemeCount("😀");          // 1
graphemeCount("👨‍👩‍👧‍👦");      // 1 (family ZWJ sequence)
graphemeCount("नमस्ते");     // 4 (4 aksharas in Devanagari)
graphemeCount("🇺🇸");         // 1 (flag = 2 regional indicators = 1 grapheme)

Intl.Segmenter also supports word and sentence segmentation, with locale-specific rules:

const wordSegmenter = new Intl.Segmenter('ja', { granularity: 'word' });
const text = "日本語のテキスト";
const words = [...wordSegmenter.segment(text)]
  .filter(s => s.isWordLike)
  .map(s => s.segment);
// Japanese word segmentation without spaces

String.fromCodePoint() and String.fromCharCode()

// fromCharCode — only handles BMP (U+0000–U+FFFF)
String.fromCharCode(65);       // 'A'
String.fromCharCode(0x1F600);  // 'ὠ' (wrong! truncates to 16 bits)

// fromCodePoint — handles full range including supplementary
String.fromCodePoint(65);       // 'A'
String.fromCodePoint(0x1F600);  // '😀' ✓
String.fromCodePoint(0x1F600, 0x20, 0x1F44D); // '😀 👍'

Always prefer String.fromCodePoint() over String.fromCharCode() in new code.

Unicode Normalization: normalize()

The same visual character can be encoded multiple ways. The letter é can be U+00E9 (precomposed: LATIN SMALL LETTER E WITH ACUTE) or U+0065 + U+0301 (decomposed: e + combining acute accent). These compare unequal with === despite looking identical:

const a = "\u00E9";        // é precomposed
const b = "e\u0301";       // e + combining accent

console.log(a === b);      // false!
console.log(a.length);     // 1
console.log(b.length);     // 2

// Normalize before comparing
a.normalize('NFC') === b.normalize('NFC'); // true — both become U+00E9
a.normalize('NFD') === b.normalize('NFD'); // true — both become e + accent

The four forms: - NFC — canonical decomposition then composition (preferred for storage/display) - NFD — canonical decomposition (useful for accent stripping) - NFKC — compatibility decomposition then composition (collapses fi ligature → fi) - NFKD — compatibility decomposition

Always normalize strings before comparison, hashing, or database storage. NFC is the right default.

localeCompare() for Sorting

Alphabetical sort differs by language. JavaScript's default Array.sort() compares UTF-16 code units — it puts Z before a and has no notion of accented character ordering:

// Wrong for human-readable sorting
["Ångström", "Äpfel", "Zebra", "apple"].sort();
// ['Zebra', 'apple', 'Ångström', 'Äpfel'] — code point order, not linguistic

// Correct: locale-aware comparison
["Ångström", "Äpfel", "Zebra", "apple"].sort((a, b) =>
  a.localeCompare(b, 'sv', { sensitivity: 'base' })
);
// Swedish: ['apple', 'Äpfel', 'Ångström', 'Zebra']

For performance with large arrays, use Intl.Collator to create a reusable comparator:

const collator = new Intl.Collator('de', { sensitivity: 'accent' });
largeArray.sort(collator.compare);

Regex u and v Flags

The u flag (ES6) enables Unicode mode in regular expressions, making the regex engine work with code points rather than code units:

// Without u flag — matches only BMP characters
/./                .test("😀"); // true (matches the high surrogate)
/^.$/              .test("😀"); // false (two code units, not one)

// With u flag — code-point aware
/./u               .test("😀"); // true
/^.$/u             .test("😀"); // true ✓

// Unicode property escapes (require u flag)
/\\p{Emoji}/u      .test("😀"); // true
/\\p{Script=Latin}/u.test("A"); // true
/\\p{Letter}/u     .test("中"); // true (CJK is Letter)
/\\p{Number}/u     .test("③"); // true (enclosed number)

The v flag (ES2024) extends u with set notation and string properties:

// v flag: set intersection, subtraction, nested classes
/[\\p{Letter}&&\\p{Script=Greek}]/v.test("α"); // true — Greek letters only
/[\\p{ASCII}--\\p{Number}]/v.test("A");        // true — ASCII non-numbers

For any regex that handles user input or multilingual text, use the u or v flag.