💻 Unicode in Code

Unicode in JavaScript

JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane — including most emoji — are stored as surrogate pairs and cause unexpected string length and indexing behavior. This guide explains Unicode in JavaScript, covering ES6 improvements, the u flag in regex, and Intl APIs.

·

JavaScript strings are sequences of UTF-16 code units, not Unicode code points. For the vast majority of text this is invisible — every ASCII character and every character in the Basic Multilingual Plane (BMP, U+0000–U+FFFF) fits in exactly one code unit. The complexity begins with emoji, rare CJK extension characters, and mathematical symbols that live above U+FFFF. This guide explains how JavaScript encodes text, how modern APIs bridge the gap, and how to avoid the classic pitfalls.

Strings as UTF-16 Code Unit Sequences

Each string value in JavaScript is backed by a sequence of 16-bit code units (the internal format defined in the ECMAScript spec). Characters in the BMP map to one code unit; characters above U+FFFF (the "supplementary planes") use a surrogate pair of two code units.

const arrow  = "→";          // U+2192, BMP — 1 code unit
const snake  = "🐍";          // U+1F40D, supplementary — 2 code units

console.log(arrow.length);   // 1
console.log(snake.length);   // 2  ← surprising for "one character"

// Code unit values
console.log(arrow.charCodeAt(0).toString(16));  // "2192"
console.log(snake.charCodeAt(0).toString(16));  // "d83d"  (high surrogate)
console.log(snake.charCodeAt(1).toString(16));  // "dc0d"  (low surrogate)

codePointAt and String.fromCodePoint

ES2015 added code-point–aware alternatives to the older charCodeAt / fromCharCode:

// Code point (not code unit) from a position
console.log("🐍".codePointAt(0));              // 128013
console.log("🐍".codePointAt(0).toString(16)); // "1f40d"

// Round-trip: code point integer → character
const snake = String.fromCodePoint(0x1f40d);   // "🐍"
const arrow  = String.fromCodePoint(0x2192);   // "→"

// Multiple characters at once
const text = String.fromCodePoint(0x48, 0x65, 0x6c, 0x6c, 0x6f); // "Hello"

Always prefer codePointAt / String.fromCodePoint over the older charCodeAt / fromCharCode when working with supplementary characters.

Iterating Over Characters Correctly

The for...of loop and the spread operator [...str] iterate over code points, not code units. This is the idiomatic way to handle emoji and supplementary characters:

const text = "A🐍B";

// Wrong — iterates over code units, splits the surrogate pair
for (let i = 0; i < text.length; i++) {
    console.log(text[i]);   // "A", "\\ud83d", "\\udc0d", "B"
}

// Correct — iterates over code points
for (const char of text) {
    console.log(char);      // "A", "🐍", "B"
}

// Spread to array of code points
const chars = [...text];    // ["A", "🐍", "B"]
console.log(chars.length);  // 3  ← correct "character count"

Grapheme Clusters

Even code-point iteration is not always enough. A flag emoji like 🇺🇸 is two code points (U+1F1FA + U+1F1F8), and many skin-tone emoji are multiple code points joined by a zero-width joiner (U+200D). The Intl.Segmenter API handles grapheme cluster boundaries correctly:

const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
const graphemes = [...segmenter.segment("👨‍👩‍👧")];
console.log(graphemes.length);      // 1 — one grapheme cluster
console.log([..."👨‍👩‍👧"].length); // 8 — eight code points (with ZWJs)

Escape Sequences

JavaScript supports several Unicode escape syntaxes in string literals and regular expressions:

// \\uXXXX — BMP only (exactly 4 hex digits)
const euro  = "\\u20AC";   // "€"
const arrow = "\\u2192";   // "→"

// \\u{XXXXX} — ES2015+, any code point (1–6 hex digits)
const snake  = "\\u{1F40D}";   // "🐍"
const text   = "\\u{48}\\u{65}\\u{6C}\\u{6C}\\u{6F}";  // "Hello"

// Deprecated: surrogate pair escape (avoid in new code)
const snake2 = "\\uD83D\\uDC0D";  // "🐍" via surrogate pair — messy

TextEncoder and TextDecoder

When you need raw bytes — for example, to send data over a network, write to a file via the File API, or compute a hash — use TextEncoder and TextDecoder:

// str → Uint8Array (UTF-8 bytes)
const encoder = new TextEncoder();  // always UTF-8
const bytes   = encoder.encode("café");
console.log(bytes);  // Uint8Array [99, 97, 102, 195, 169]

// Uint8Array → str
const decoder = new TextDecoder("utf-8");  // or "utf-16le", "latin-1", etc.
const text    = decoder.decode(bytes);     // "café"

TextEncoder always outputs UTF-8. TextDecoder accepts a wide range of encodings (all WHATWG Encoding Standard labels).

The Intl APIs

Modern JavaScript ships a comprehensive internationalisation library under Intl. Key objects:

API Purpose Example
Intl.Collator Locale-aware string comparison and sorting German ä after a
Intl.Segmenter Split strings into graphemes, words, sentences Count real characters
Intl.ListFormat Locale-aware list formatting "a, b, and c"
Intl.NumberFormat Format numbers (currencies, percentages) "€42,00" in German
// Sorting German strings correctly
const words = ["Äpfel", "Orangen", "Bananen"];
words.sort(new Intl.Collator("de").compare);
// ["Äpfel", "Bananen", "Orangen"]  — Ä treated like A in German

// Word segmentation
const seg = new Intl.Segmenter("en", { granularity: "word" });
const words2 = [...seg.segment("Hello, world!")].filter(s => s.isWordLike);
// [{segment: "Hello"}, {segment: "world"}]

Surrogate Pairs — Deep Dive

Surrogates are the source of many subtle JavaScript bugs:

const snake = "🐍";

// slice() works on code units — can split a surrogate pair
snake.slice(0, 1);   // "\\uD83D"  ← broken half of surrogate pair

// substring() — same issue
snake.substring(0, 1);   // "\\uD83D"

// Safe: use Array.from or spread
Array.from(snake).slice(0, 1);   // ["🐍"]  ← correct
[...snake][0];                    // "🐍"

// Checking for lone surrogates
function hasLoneSurrogate(str) {
    return /[\\uD800-\\uDFFF]/.test(str);
}

ES2024 introduced String.prototype.isWellFormed() and toWellFormed() to detect and fix lone surrogates:

"\\uD800".isWellFormed();     // false — lone high surrogate
"\\uD800".toWellFormed();     // "\\uFFFD"  — replaced with REPLACEMENT CHARACTER
"Hello 🐍".isWellFormed();  // true

Regular Expressions and the /u Flag

Without the u flag, regex patterns match UTF-16 code units. With /u, they match Unicode code points. This matters for characters above U+FFFF:

// Without /u — treats surrogate pair as two separate characters
/^.$/.test("🐍");     // false  (emoji has .length === 2)

// With /u — treats emoji as one character
/^.$/u.test("🐍");    // true

// \\p{} Unicode property escapes require /u or /v
/\\p{Emoji}/u.test("🐍");       // true
/\\p{Script=Latin}/u.test("A"); // true

// /v flag (ES2024) — superset of /u with set notation
/[\\p{Letter}&&\\p{ASCII}]/v.test("A");  // true  — intersection

Always use the u flag (or v in ES2024+) in new regular expressions.

Common Pitfalls Summary

Pitfall Problem Fix
str.length for character count Returns code unit count [...str].length or Intl.Segmenter
str[i] indexing emoji Returns lone surrogate [...str][i]
Regex without /u Treats emoji as two chars Add /u flag
charCodeAt for all chars Fails above U+FFFF Use codePointAt
String.fromCharCode for emoji Needs surrogate pair Use String.fromCodePoint
Sorting with < / > Code point order, not locale Intl.Collator

Quick Reference

// Code point of a character
"→".codePointAt(0)                // 8594
"→".codePointAt(0).toString(16)   // "2192"

// Character from code point
String.fromCodePoint(0x2192)      // "→"
String.fromCodePoint(0x1F40D)     // "🐍"

// True character count
[..."café🐍"].length               // 5

// Iterate safely
for (const ch of "café🐍") { ... }

// Encode to bytes
new TextEncoder().encode("café")

// Decode from bytes
new TextDecoder().decode(uint8array)

// Locale-aware sort
arr.sort(new Intl.Collator("en").compare)

// Regex with Unicode support
/\\p{Letter}+/u.exec("café")

Understanding the UTF-16 internals of JavaScript strings is essential for writing robust internationalized applications. The modern APIs — codePointAt, String.fromCodePoint, for...of, Intl.Segmenter, and the /u regex flag — give you the tools to handle all of Unicode correctly.

Thêm trong Unicode in Code