CSS and Unicode: Beyond content: ""

CSS touches Unicode in more ways than most developers realize. Beyond the obvious content property there are font subsetting directives, bidirectional text controls, generated quotation marks that vary by language, and text-transform rules that must understand Unicode case mapping to work correctly. This chapter covers the full spectrum of Unicode-aware CSS, from the practical to the obscure.

CSS Unicode Escapes in the `content` Property

The CSS content property inserts generated text into the document via ::before and ::after pseudo-elements. To insert a character by code point, use a backslash followed by the hexadecimal code point:

/* Snowman U+2603 */
.weather-icon::before {
  content: "\\2603";
}

/* Checkmark U+2713 */
.done::before {
  content: "\\2713 ";
  color: green;
}

/* Emoji: thumbs up U+1F44D */
.like::before {
  content: "\\1F44D";
}

The CSS escape syntax is \ followed by 1–6 hex digits, optionally followed by a single whitespace character (which is consumed as part of the escape, not output). So \2603 (with a trailing space) correctly inserts a snowman — the space is the escape terminator, not part of the output.

You can also include literal Unicode characters directly in your CSS file (saved as UTF-8):

.weather-icon::before {
  content: "☃";  /* literal snowman */
}

Both forms are equivalent. Prefer literals for readability; use escapes when your tooling strips non-ASCII characters or when you need exact control over invisible characters.

`unicode-range` in @font-face for Subsetting

Loading an entire font file for a page that uses only Latin characters is wasteful. The unicode-range descriptor in @font-face tells the browser to download a font file only when the page actually contains characters in the specified range:

/* Load Noto Sans only for CJK Unified Ideographs */
@font-face {
  font-family: 'NotoSansCJK';
  src: url('/fonts/NotoSansCJK-Regular.woff2') format('woff2');
  unicode-range: U+4E00-9FFF,  /* CJK Unified Ideographs */
                 U+3400-4DBF,  /* CJK Extension A */
                 U+20000-2A6DF; /* CJK Extension B */
}

/* Load a lightweight Latin font for the rest */
@font-face {
  font-family: 'NotoSansCJK';
  src: url('/fonts/NotoSans-Regular.woff2') format('woff2');
  unicode-range: U+0000-00FF,  /* Basic Latin + Latin-1 Supplement */
                 U+0100-017F;  /* Latin Extended-A */
}

The browser checks whether the current page's text contains any character in the unicode-range. If not, the font file is never downloaded. Google Fonts uses this technique extensively — that is why a single font-family declaration on Google Fonts results in multiple small downloads instead of one large file.

The range syntax supports: - Single code points: U+26 - Ranges: U+0025-00FF - Wildcard digits: U+004? (matches U+0040 through U+004F)

Quotation Marks: The `quotes` Property

Different languages use different quotation mark conventions. English uses "..." and '...', French uses «...» with non-breaking spaces, German uses „...", Japanese uses 「...」. CSS can handle all of these:

:lang(en) { quotes: "\201C" "\201D" "\2018" "\2019"; }  /* "..." '...' */
:lang(fr) { quotes: "\AB\A0" "\A0\BB" "\2039\A0" "\A0\203A"; }  /* «...» ‹...› */
:lang(de) { quotes: "\201E" "\201C" "\201A" "\2018"; }  /* „..." ‚...' */
:lang(ja) { quotes: "\300C" "\300D" "\300E" "\300F"; }  /* 「...」『...』 */

Then in your HTML, use the <q> element and CSS injects the correct glyphs automatically:

<p lang="fr">Il a dit <q>bonjour</q> à tout le monde.</p>

Output: Il a dit «bonjour» à tout le monde.

The four values in quotes are: outer-open, outer-close, inner-open, inner-close, matching nested <q> elements.

CSS Generated Content with Emoji

Emoji work in content but require care. Color emoji are rendered by a system font (Apple Color Emoji, Noto Color Emoji, Segoe UI Emoji), not your web font. The presentation selector controls which variant is used:

/* Force emoji presentation (color) */
.heart::before {
  content: "\\2764\\FE0F";  /* ❤ + VS16 emoji variation selector */
}

/* Force text presentation (monochrome) */
.heart-text::before {
  content: "\\2764\\FE0E";  /* ❤ + VS15 text variation selector */
}

Many characters have both text and emoji presentations. Variation Selector 15 (U+FE0E) requests text; Variation Selector 16 (U+FE0F) requests emoji. Without a selector, the browser chooses based on platform defaults.

`text-transform` and Unicode Case Mapping

text-transform: uppercase and text-transform: lowercase are not simple ASCII operations. They invoke Unicode case mapping, which has language-specific rules:

/* Turkish: dotted and dotless I */
:lang(tr) {
  text-transform: uppercase;
  /* 'i' → 'İ' (U+0130), not 'I' */
  /* 'ı' → 'I' (U+0049) */
}

/* German: ß uppercases to SS */
.german-header {
  text-transform: uppercase;
  /* 'straße' → 'STRASSE' (two characters, not ẞ in most contexts) */
}

/* Greek: σ vs ς (word-final sigma) */
:lang(el) {
  text-transform: uppercase;
  /* 'θεσσαλονίκη' → 'ΘΕΣΣΑΛΟΝΊΚΗ' */
}

CSS text-transform: capitalize capitalizes the first letter of each word, but "word" and "first letter" follow Unicode word-boundary rules, not simple space-splitting.

Never use text-transform as a substitute for semantic casing. Screen readers read the original source text, not the visual output. A heading that visually reads "WARNING" but is coded as <h2>warning</h2> with text-transform: uppercase will be read as "warning" by many screen readers.

Writing Mode for Vertical Text

Unicode supports scripts written vertically (Traditional CJK, Mongolian). CSS writing-mode controls text flow direction:

/* Vertical right-to-left (Traditional Chinese/Japanese publishing) */
.vertical-text {
  writing-mode: vertical-rl;
  text-orientation: mixed;
}

/* Vertical left-to-right (Mongolian) */
.mongolian {
  writing-mode: vertical-lr;
  text-orientation: upright;
}

/* Horizontal (default) */
.horizontal {
  writing-mode: horizontal-tb;
}

text-orientation controls how individual characters are oriented within vertical lines: - mixed — Latin and numbers rotate 90° clockwise; CJK stays upright - upright — all characters are upright (used for full CJK vertical text) - sideways — all characters rotated 90° clockwise

`direction` and `unicode-bidi`

For bidirectional text (Arabic, Hebrew mixed with Latin), CSS provides two properties that work alongside the HTML dir attribute:

/* Override direction without changing HTML structure */
.force-rtl {
  direction: rtl;
  unicode-bidi: bidi-override;
}

/* Isolate a run of text from surrounding bidi context */
.isolate {
  unicode-bidi: isolate;
}

/* Create an embed context */
.embed-ltr {
  direction: ltr;
  unicode-bidi: embed;
}

In practice, prefer HTML attributes (dir="rtl", <bdi>, <bdo>) and the unicode-bidi: isolate CSS value over bidi-override. The bidi-override value forces a direction on every character, which can make Arabic and Hebrew unreadable if applied incorrectly.

Practical Font Stack for Maximum Unicode Coverage

Combining all these techniques, here is a practical font stack that handles most writing systems:

:root {
  --font-sans: 'Inter', 'Noto Sans', system-ui,
    /* CJK */
    'Noto Sans CJK SC', 'Noto Sans CJK TC', 'Noto Sans CJK JP',
    /* Arabic, Hebrew, Devanagari */
    'Noto Sans Arabic', 'Noto Sans Hebrew', 'Noto Sans Devanagari',
    /* Emoji */
    'Apple Color Emoji', 'Noto Color Emoji', 'Segoe UI Emoji',
    sans-serif;
}

body {
  font-family: var(--font-sans);
}

With @font-face unicode-range subsetting, browsers download only the font files they actually need for the characters on the current page. A page in English loads only the Inter/system fonts; a page with Japanese text additionally loads the CJK font — automatically.

CSS Unicode Escapes in the content Property

unicode-range in @font-face for Subsetting

Quotation Marks: The quotes Property