The Unicode Odyssey · Chapter 9
Unicode on the Web: HTML, CSS, and Beyond
The web is built on Unicode — from HTML entities to CSS content properties to web fonts. This chapter covers everything you need to know about using Unicode correctly in web development.
The web is the largest deployment of Unicode in human history. Every HTTP request, every HTML page, every CSS rule, every URL, every JavaScript string — all of them operate within a Unicode-aware infrastructure that took decades to build and that still has rough edges. Understanding how Unicode flows through the web stack, from server bytes to browser pixels, is essential knowledge for any web developer working across language boundaries.
The Character Encoding Declaration
Before a browser can interpret any character in an HTML document, it needs to know the encoding. HTML5 provides two mechanisms, and the rules for which one wins are precisely specified.
HTTP Content-Type header (highest priority for external resources):
Content-Type: text/html; charset=utf-8
Meta charset tag (used for <head> scanning):
<meta charset="UTF-8">
The older HTTP-Equiv form still works:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
HTML5 defines a careful sniffing algorithm: the browser reads the first 1,024 bytes of the document looking for a BOM or meta charset declaration, before it has finished decoding the document. This is why the <meta charset> tag must appear within the first 1,024 bytes, ideally as the very first element inside <head>.
If no encoding is declared, browsers apply a detection algorithm based on byte frequency analysis — a technique that works reasonably well for major encodings but can produce incorrect results for ambiguous byte sequences. The HTML5 specification's default is UTF-8, and the strong recommendation is to always declare UTF-8 explicitly.
HTML Entities: Two Ways to Write One Character
HTML provides character references as an alternative to writing non-ASCII characters directly in the source. They come in three forms:
Named character references: & (& sign), < (<), > (>), (non-breaking space), é (é), © (©)
Decimal numeric character references: é (decimal 233 = U+00E9, é)
Hexadecimal numeric character references: é (hex E9 = U+00E9, é)
In a UTF-8 document, there is no technical reason to use character references for characters like é, ™, or €. Write them directly. Character references are necessary only for the five characters that have special meaning in HTML syntax:
| Character | Reference | Reason |
|---|---|---|
| & | & |
Starts all character references |
| < | < |
Starts tag syntax |
| > | > |
Ends tag syntax |
| " | " |
Attribute value delimiter |
| ' | ' |
Attribute value delimiter (HTML5) |
The old practice of encoding every non-ASCII character as an entity (writing é instead of é) was a workaround for unreliable encoding handling in early browsers. Modern UTF-8 HTML makes this unnecessary and makes source files harder to read.
URL Encoding: Percent-Encoding
URLs are fundamentally ASCII — the RFC 3986 URI specification defines a syntax using only ASCII characters. Non-ASCII characters in URLs are represented using percent-encoding (also called URL encoding): each byte of the UTF-8 encoding is written as %XX where XX is the hex byte value.
For "café" in a URL:
c → %63 (or c, since it's unreserved)
a → %61 (or a)
f → %66 (or f)
é → UTF-8: C3 A9 → %C3%A9
Result: /restaurants/caf%C3%A9 or, with unreserved characters left as-is: /restaurants/caf%C3%A9
The full process: take the character → encode as UTF-8 → percent-encode each byte that isn't an unreserved character (A-Z, a-z, 0-9, -, ., _, ~).
Internationalized Resource Identifiers (IRIs) (RFC 3987) extend this to allow Unicode characters directly in the URI, with percent-encoding only for characters outside the IRI allowed set. Modern browsers display IRIs in their Unicode form in the address bar (for trusted scripts) while transmitting the percent-encoded form in HTTP requests.
IRI displayed: https://例え.jp/パス
HTTP request sends: https://xn--r8jz45g.jp/%E3%83%91%E3%82%B9
In JavaScript:
// Encoding
encodeURIComponent("café") // "caf%C3%A9" (encodes all non-URI chars)
encodeURI("https://example.com/café") // preserves URI structure chars
// Decoding
decodeURIComponent("caf%C3%A9") // "café"
CSS and Unicode
CSS interacts with Unicode in several important ways.
Unicode-Range Descriptor
The @font-face rule's unicode-range descriptor specifies which codepoints a particular font file covers. This allows loading multiple font files — one for Latin, one for CJK, one for Arabic — and having the browser select the appropriate one based on the characters in the text:
@font-face {
font-family: 'MultiScript';
src: url('latin.woff2');
unicode-range: U+0000-00FF, U+0131, U+0152-0153, U+02BB-02BC, U+02C6;
}
@font-face {
font-family: 'MultiScript';
src: url('japanese.woff2');
unicode-range: U+3000-9FFF, U+FF00-FFEF;
}
The browser only downloads the Japanese font file when Japanese characters are actually present on the page — a significant performance optimization for multilingual sites.
Font Fallback and Coverage
No single font covers all of Unicode's 155,000+ characters. Browsers implement font fallback: when the primary font lacks a glyph for a character, the browser checks the next font in the font-family stack, then the system's default fallback fonts. This is why text in rare scripts or uncommon emoji usually renders correctly even without explicitly setting a font — the OS provides fallback fonts.
The font-family: 'Noto Sans', sans-serif pattern leverages Google's Noto font project, which provides fonts covering virtually all Unicode scripts, each named "No Tofu" — a reference to the tofu boxes (empty rectangles) that appear when a glyph is missing.
CSS Content and Unicode Escapes
In CSS, Unicode characters can be written directly or using CSS escape syntax:
/* Direct Unicode character */
.icon::before { content: "→"; }
/* CSS Unicode escape (backslash + hex codepoint) */
.icon::before { content: "\\2192"; } /* → RIGHTWARDS ARROW */
/* No curly braces needed for 6 hex digits */
.icon::before { content: "\\01F600"; } /* 😀 */
CSS also provides the quotes property for specifying quotation mark characters appropriate to a document's language:
:lang(fr) { quotes: "«" "»" "‹" "›"; }
:lang(de) { quotes: "\u201e" "\u201c" "\u201a" "\u2018"; }
:lang(ja) { quotes: "「" "」" "『" "』"; }
JavaScript String Handling in the DOM
JavaScript's interaction with the DOM introduces another layer of Unicode complexity. DOM text nodes, attribute values, and innerHTML content are all sequences of UTF-16 code units internally.
When you set element.textContent = str, the browser inserts the string's text content as a text node, correctly handling all Unicode including surrogates. When you read element.textContent, you get the text content back as a JavaScript string.
The critical distinction is between textContent and innerHTML:
// SAFE: textContent escapes HTML special characters
element.textContent = "<script>alert(1)</script>";
// → displays the literal string, no XSS
// DANGEROUS: innerHTML parses HTML
element.innerHTML = userInput; // XSS vulnerability if input not sanitized
For Unicode, innerHTML has an additional subtlety: HTML entities in the string are decoded. element.innerHTML = "<b>" renders as <b>, not as the literal entity string.
Form Submission Encoding
HTML forms submit data encoded as the form's declared charset, defaulting to the document charset (UTF-8 in modern pages). The <form accept-charset> attribute can specify an alternative encoding, but this is rarely needed since UTF-8 handles all characters.
Form data is percent-encoded for application/x-www-form-urlencoded (the default) or transmitted as-is for multipart/form-data. When a server receives form data, it should:
- Know the content type and charset from the
Content-Typeheader - Decode percent-encoding (for urlencoded forms)
- Interpret the resulting bytes as the declared charset
A common bug: servers that assume form data is ASCII or Latin-1, corrupting non-ASCII input. Always treat form data as UTF-8 on modern web applications.
Server-Side: Content-Type Headers Matter
Servers must send correct Content-Type headers. A mismatch between the declared charset and the actual file encoding corrupts all non-ASCII characters for every reader.
# Correct for UTF-8 HTML
Content-Type: text/html; charset=utf-8
# Correct for UTF-8 JSON (RFC 8259 mandates UTF-8; charset is informational)
Content-Type: application/json; charset=utf-8
# Correct for UTF-8 XML
Content-Type: application/xml; charset=utf-8
For JSON specifically, RFC 8259 (the current JSON standard) requires UTF-8 encoding. The charset parameter is technically unnecessary but widely included for clarity.
Web Fonts and Unicode Coverage
For production multilingual web applications, web font selection and loading strategy significantly impact both rendering quality and performance:
- System fonts provide the broadest Unicode coverage for free but vary across OS and look inconsistent
- Google Fonts provides good coverage for many scripts with CDN delivery and subsetting
- Noto fonts provide complete Unicode coverage but are large; use
unicode-rangesubsetting - Variable fonts can cover multiple weights/styles in one file, reducing requests
The font loading strategy matters for perceived performance: use font-display: swap to show text in a fallback font while the custom font loads, preventing invisible text during font download (FOIT — Flash of Invisible Text).
The lang Attribute and Unicode
The HTML lang attribute doesn't just help search engines and accessibility tools — it affects Unicode-aware features of text rendering. Browsers and operating systems use the lang attribute to:
- Select appropriate font fallbacks for ambiguous Unicode characters (CJK characters used in Japanese vs. Chinese have different preferred glyphs)
- Apply correct hyphenation rules
- Determine line-breaking behavior (Thai and Chinese have different word-boundary rules)
- Choose appropriate quote characters when using CSS
quotes: auto
<html lang="zh-Hans"><!-- Simplified Chinese — selects simplified CJK glyphs -->
<html lang="ja"><!-- Japanese — selects Japanese glyph preferences -->
<p lang="ar"><!-- Arabic — enables RTL layout for this paragraph -->
The web's Unicode infrastructure is mature and robust when used correctly. The key disciplines are: declare UTF-8 everywhere, normalize input at system boundaries, use textContent over innerHTML, percent-encode URLs, and specify lang attributes for multilingual content. These practices, applied consistently, produce web experiences that work correctly for users in every language.