💻 Unicode in Code

Unicode in JavaScript

JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane — including most emoji — are stored as surrogate pairs and cause unexpected string length and indexing behavior. This guide explains Unicode in JavaScript, covering ES6 improvements, the u flag in regex, and Intl APIs.

Published 2022-02-03 · Updated 2024-11-14

JavaScript strings are sequences of UTF-16 code units, not Unicode code points. For the vast majority of text this is invisible — every ASCII character and every character in the Basic Multilingual Plane (BMP, U+0000–U+FFFF) fits in exactly one code unit. The complexity begins with emoji, rare CJK extension characters, and mathematical symbols that live above U+FFFF. This guide explains how JavaScript encodes text, how modern APIs bridge the gap, and how to avoid the classic pitfalls.

Strings as UTF-16 Code Unit Sequences

Each string value in JavaScript is backed by a sequence of 16-bit code units (the internal format defined in the ECMAScript spec). Characters in the BMP map to one code unit; characters above U+FFFF (the "supplementary planes") use a surrogate pair of two code units.

const arrow  = "→";          // U+2192, BMP — 1 code unit
const snake  = "🐍";          // U+1F40D, supplementary — 2 code units

console.log(arrow.length);   // 1
console.log(snake.length);   // 2  ← surprising for "one character"

// Code unit values
console.log(arrow.charCodeAt(0).toString(16));  // "2192"
console.log(snake.charCodeAt(0).toString(16));  // "d83d"  (high surrogate)
console.log(snake.charCodeAt(1).toString(16));  // "dc0d"  (low surrogate)

`codePointAt` and `String.fromCodePoint`

ES2015 added code-point–aware alternatives to the older charCodeAt / fromCharCode:

// Code point (not code unit) from a position
console.log("🐍".codePointAt(0));              // 128013
console.log("🐍".codePointAt(0).toString(16)); // "1f40d"

// Round-trip: code point integer → character
const snake = String.fromCodePoint(0x1f40d);   // "🐍"
const arrow  = String.fromCodePoint(0x2192);   // "→"

// Multiple characters at once
const text = String.fromCodePoint(0x48, 0x65, 0x6c, 0x6c, 0x6f); // "Hello"

Always prefer codePointAt / String.fromCodePoint over the older charCodeAt / fromCharCode when working with supplementary characters.

Iterating Over Characters Correctly

The for...of loop and the spread operator [...str] iterate over code points, not code units. This is the idiomatic way to handle emoji and supplementary characters:

const text = "A🐍B";

// Wrong — iterates over code units, splits the surrogate pair
for (let i = 0; i < text.length; i++) {
    console.log(text[i]);   // "A", "\\ud83d", "\\udc0d", "B"
}

// Correct — iterates over code points
for (const char of text) {
    console.log(char);      // "A", "🐍", "B"
}

// Spread to array of code points
const chars = [...text];    // ["A", "🐍", "B"]
console.log(chars.length);  // 3  ← correct "character count"

Grapheme Clusters

Even code-point iteration is not always enough. A flag emoji like 🇺🇸 is two code points (U+1F1FA + U+1F1F8), and many skin-tone emoji are multiple code points joined by a zero-width joiner (U+200D). The Intl.Segmenter API handles grapheme cluster boundaries correctly:

const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
const graphemes = [...segmenter.segment("👨‍👩‍👧")];
console.log(graphemes.length);      // 1 — one grapheme cluster
console.log([..."👨‍👩‍👧"].length); // 8 — eight code points (with ZWJs)

Escape Sequences

JavaScript supports several Unicode escape syntaxes in string literals and regular expressions:

// \\uXXXX — BMP only (exactly 4 hex digits)
const euro  = "\\u20AC";   // "€"
const arrow = "\\u2192";   // "→"

// \\u{XXXXX} — ES2015+, any code point (1–6 hex digits)
const snake  = "\\u{1F40D}";   // "🐍"
const text   = "\\u{48}\\u{65}\\u{6C}\\u{6C}\\u{6F}";  // "Hello"

// Deprecated: surrogate pair escape (avoid in new code)
const snake2 = "\\uD83D\\uDC0D";  // "🐍" via surrogate pair — messy

`TextEncoder` and `TextDecoder`

When you need raw bytes — for example, to send data over a network, write to a file via the File API, or compute a hash — use TextEncoder and TextDecoder:

// str → Uint8Array (UTF-8 bytes)
const encoder = new TextEncoder();  // always UTF-8
const bytes   = encoder.encode("café");
console.log(bytes);  // Uint8Array [99, 97, 102, 195, 169]

// Uint8Array → str
const decoder = new TextDecoder("utf-8");  // or "utf-16le", "latin-1", etc.
const text    = decoder.decode(bytes);     // "café"

TextEncoder always outputs UTF-8. TextDecoder accepts a wide range of encodings (all WHATWG Encoding Standard labels).

The `Intl` APIs

Modern JavaScript ships a comprehensive internationalisation library under Intl. Key objects:

API	Purpose	Example
`Intl.Collator`	Locale-aware string comparison and sorting	German `ä` after `a`
`Intl.Segmenter`	Split strings into graphemes, words, sentences	Count real characters
`Intl.ListFormat`	Locale-aware list formatting	"a, b, and c"
`Intl.NumberFormat`	Format numbers (currencies, percentages)	"€42,00" in German

// Sorting German strings correctly
const words = ["Äpfel", "Orangen", "Bananen"];
words.sort(new Intl.Collator("de").compare);
// ["Äpfel", "Bananen", "Orangen"]  — Ä treated like A in German

// Word segmentation
const seg = new Intl.Segmenter("en", { granularity: "word" });
const words2 = [...seg.segment("Hello, world!")].filter(s => s.isWordLike);
// [{segment: "Hello"}, {segment: "world"}]

Surrogate Pairs — Deep Dive

Surrogates are the source of many subtle JavaScript bugs:

const snake = "🐍";

// slice() works on code units — can split a surrogate pair
snake.slice(0, 1);   // "\\uD83D"  ← broken half of surrogate pair

// substring() — same issue
snake.substring(0, 1);   // "\\uD83D"

// Safe: use Array.from or spread
Array.from(snake).slice(0, 1);   // ["🐍"]  ← correct
[...snake][0];                    // "🐍"

// Checking for lone surrogates
function hasLoneSurrogate(str) {
    return /[\\uD800-\\uDFFF]/.test(str);
}

ES2024 introduced String.prototype.isWellFormed() and toWellFormed() to detect and fix lone surrogates:

"\\uD800".isWellFormed();     // false — lone high surrogate
"\\uD800".toWellFormed();     // "\\uFFFD"  — replaced with REPLACEMENT CHARACTER
"Hello 🐍".isWellFormed();  // true

Regular Expressions and the `/u` Flag

Without the u flag, regex patterns match UTF-16 code units. With /u, they match Unicode code points. This matters for characters above U+FFFF:

// Without /u — treats surrogate pair as two separate characters
/^.$/.test("🐍");     // false  (emoji has .length === 2)

// With /u — treats emoji as one character
/^.$/u.test("🐍");    // true

// \\p{} Unicode property escapes require /u or /v
/\\p{Emoji}/u.test("🐍");       // true
/\\p{Script=Latin}/u.test("A"); // true

// /v flag (ES2024) — superset of /u with set notation
/[\\p{Letter}&&\\p{ASCII}]/v.test("A");  // true  — intersection

Always use the u flag (or v in ES2024+) in new regular expressions.

Common Pitfalls Summary

Pitfall	Problem	Fix
`str.length` for character count	Returns code unit count	`[...str].length` or `Intl.Segmenter`
`str[i]` indexing emoji	Returns lone surrogate	`[...str][i]`
Regex without `/u`	Treats emoji as two chars	Add `/u` flag
`charCodeAt` for all chars	Fails above U+FFFF	Use `codePointAt`
`String.fromCharCode` for emoji	Needs surrogate pair	Use `String.fromCodePoint`
Sorting with `<` / `>`	Code point order, not locale	`Intl.Collator`

Quick Reference

// Code point of a character
"→".codePointAt(0)                // 8594
"→".codePointAt(0).toString(16)   // "2192"

// Character from code point
String.fromCodePoint(0x2192)      // "→"
String.fromCodePoint(0x1F40D)     // "🐍"

// True character count
[..."café🐍"].length               // 5

// Iterate safely
for (const ch of "café🐍") { ... }

// Encode to bytes
new TextEncoder().encode("café")

// Decode from bytes
new TextDecoder().decode(uint8array)

// Locale-aware sort
arr.sort(new Intl.Collator("en").compare)

// Regex with Unicode support
/\\p{Letter}+/u.exec("café")

Understanding the UTF-16 internals of JavaScript strings is essential for writing robust internationalized applications. The modern APIs — codePointAt, String.fromCodePoint, for...of, Intl.Segmenter, and the /u regex flag — give you the tools to handle all of Unicode correctly.

Thêm trong Unicode in Code

Unicode in Python

Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …

Unicode in Java

Java's char type is a 16-bit UTF-16 code unit, not a full …

Unicode in Go

Go's string type is a sequence of bytes, and its rune type …

Unicode in Rust

Rust's str and String types are guaranteed to be valid UTF-8, making …

Unicode in C/C++

C and C++ have historically poor Unicode support, with char being a …

Unicode in Ruby

Ruby strings carry an explicit encoding, with UTF-8 being the default since …

Unicode in PHP

PHP's built-in string functions operate on bytes rather than Unicode characters, which …

Unicode in Swift

Swift's String type is designed with Unicode correctness as a first-class concern, …

Unicode in HTML & CSS

HTML and CSS support Unicode characters directly and through escape sequences, allowing …

Unicode in Regular Expressions

Unicode-aware regular expressions let you match characters by script, category, or property …

Unicode in SQL

SQL databases store text in encodings and collations that determine how characters …

Unicode in URLs

URLs are technically restricted to ASCII characters, so non-ASCII text must be …

Unicode Escape Sequences: Cross-Language Reference

Every major programming language has its own syntax for embedding Unicode characters …

How to Handle Unicode in APIs and JSON

JSON is defined as Unicode text and must be encoded in UTF-8, …

← Quay lại Hướng dẫn