Unicode in HTML & CSS
HTML and CSS support Unicode characters directly and through escape sequences, allowing developers to embed any character in web pages without encoding issues. This guide covers the charset meta tag, HTML entity references, CSS unicode-range, and how to insert special characters in markup and styles.
HTML and CSS have supported Unicode from their earliest days, but the toolset
has grown considerably. Correct encoding declarations, HTML entities, CSS
content values, and @font-face unicode-range descriptors all interact
with Unicode in different ways. This guide walks through every layer, from the
<meta charset> tag to advanced CSS font subsetting.
Declaring the Character Encoding in HTML
Every HTML document should declare its character encoding in the first 1,024 bytes so that browsers can parse the document before encountering any non-ASCII characters. The standard way in HTML5 is:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>My Page</title>
</head>
The <meta charset> tag must appear before any non-ASCII content or <title>
element. UTF-8 is the only encoding mandated by the WHATWG HTML standard for
new documents.
Servers should also send the correct Content-Type header:
Content-Type: text/html; charset=UTF-8
If both the HTTP header and the <meta> tag are present, the HTTP header takes
priority (except for local files opened without a server, where the meta tag is
used). For consistency, keep both in sync.
HTML Entities
HTML entities let you include special characters using ASCII-safe syntax. They are useful when:
- You need to include characters that have special meaning in HTML (
<,>,&) - Your editor or CMS cannot easily insert certain characters
- You want to make the source human-readable
Named Entities
Named entities use a mnemonic name between & and ;:
< <!-- < LESS-THAN SIGN -->
> <!-- > GREATER-THAN SIGN -->
& <!-- & AMPERSAND -->
" <!-- " QUOTATION MARK -->
' <!-- ' APOSTROPHE (HTML5) -->
<!-- non-breaking space U+00A0 -->
© <!-- © COPYRIGHT SIGN U+00A9 -->
® <!-- ® REGISTERED SIGN U+00AE -->
™ <!-- ™ TRADE MARK SIGN U+2122 -->
€ <!-- € EURO SIGN U+20AC -->
— <!-- — EM DASH U+2014 -->
… <!-- … HORIZONTAL ELLIPSIS U+2026 -->
→ <!-- → RIGHTWARDS ARROW U+2192 -->
← <!-- ← LEFTWARDS ARROW U+2190 -->
♥ <!-- ♥ BLACK HEART SUIT U+2665 -->
HTML5 defines over 2,000 named entities. A complete list is in the WHATWG named character references table.
Numeric Decimal Entities
Any Unicode code point can be written as a decimal numeric entity using the
format &#NNNN;:
© <!-- © U+00A9 COPYRIGHT SIGN -->
→ <!-- → U+2192 RIGHTWARDS ARROW -->
🐍 <!-- 🐍 U+1F40D SNAKE -->
The decimal value is simply the integer value of the code point.
Numeric Hexadecimal Entities
Hex entities use &#xHHHH; (case-insensitive x):
© <!-- © U+00A9 -->
→ <!-- → U+2192 -->
🐍 <!-- 🐍 U+1F40D -->
Hexadecimal entities are more common among developers because code points are
conventionally written in hex (U+2192). The U+ prefix is not valid HTML — use
&#x instead.
When to Use Entities vs. Literal Characters
| Situation | Recommendation |
|---|---|
<, >, & in text content |
Always use <, >, & |
" inside attribute values |
Use " or switch to single quotes |
| Non-ASCII in UTF-8 document | Prefer literal character (simpler, more readable) |
| Non-ASCII in ASCII-encoded document | Use numeric entity |
| Control characters (U+0000–U+001F) | Use numeric entity (raw control chars are invalid in HTML) |
In a modern UTF-8 document you can write ©, →, and 🐍 directly in
the source without any entity syntax.
Unicode in CSS
The content Property
The CSS content property (used with ::before and ::after) supports
Unicode via an escape sequence using a backslash followed by 1–6 hex digits:
/* Add a right arrow before each link */
a::before {
content: "\2192 "; /* → U+2192 RIGHTWARDS ARROW */
}
/* Decorative quotation marks */
blockquote::before { content: "\201C"; } /* " U+201C LEFT DOUBLE QUOTATION MARK */
blockquote::after { content: "\201D"; } /* " U+201D RIGHT DOUBLE QUOTATION MARK */
/* Emoji (supplementary plane) */
.warning::before {
content: "\26A0\\FE0F "; /* ⚠️ U+26A0 + U+FE0F variation selector */
}
Note: In CSS, the escape ends at the first character that is not a valid hex
digit, or you can terminate it with a space (which is consumed). To include a
literal space after the character, add two spaces or use \20 (the hex for
space):
/* Two ways to get "→ " (arrow + space) */
content: "\2192 "; /* space terminates the escape, then another space */
content: "\2192\20"; /* explicit U+0020 SPACE */
unicode-range in @font-face
The unicode-range descriptor in @font-face tells the browser which code
points a font file covers. The browser only downloads the font file if the
page actually uses one of those code points — this is the basis of font
subsetting in variable font stacks and icon fonts.
/* Load a Latin supplement font only when Latin Extended characters are needed */
@font-face {
font-family: "MyFont";
src: url("myfont-latin-ext.woff2") format("woff2");
unicode-range: U+0100-024F, U+0259, U+1E00-1EFF, U+2020, U+20A0-20AB,
U+20AD-20CF, U+2113, U+2C60-2C7F, U+A720-A7FF;
}
/* Separate file for CJK — only downloaded on Chinese/Japanese/Korean pages */
@font-face {
font-family: "MyFont";
src: url("myfont-cjk.woff2") format("woff2");
unicode-range: U+3000-9FFF, U+F900-FAFF, U+FE30-FE4F;
}
Google Fonts uses exactly this technique — inspecting the CSS it delivers
reveals dozens of @font-face rules for different subsets (latin, latin-ext,
cyrillic, greek, vietnamese, etc.), each with a precise unicode-range.
Range Syntax
| Syntax | Example | Meaning |
|---|---|---|
| Single code point | U+2192 |
Only U+2192 |
| Range | U+2190-21FF |
U+2190 through U+21FF |
| Wildcard | U+25?? |
U+2500 through U+25FF (? = any hex digit) |
| List | U+0020, U+0041 |
U+0020 and U+0041 only |
Combining Range Descriptors
@font-face {
font-family: "Icons";
src: url("icons.woff2") format("woff2");
/* PUA block commonly used for icon fonts */
unicode-range: U+E000-F8FF;
}
Language and Text Direction
HTML provides attributes to assist browsers and screen readers with Unicode text:
<!-- Language tag: affects font selection, hyphenation, spell-check -->
<html lang="ja">
<!-- Inline language switch -->
<p>The Japanese word for "cat" is <span lang="ja">猫</span>.</p>
<!-- Right-to-left text (Arabic, Hebrew) -->
<p dir="rtl" lang="ar">مرحبا بالعالم</p>
<!-- Bidirectional isolation — prevents RTL bleed-through -->
<bdi>مرحبا</bdi>
<!-- Force a specific direction -->
<bdo dir="ltr">مرحبا</bdo>
CSS Direction Properties
/* For RTL languages */
[lang="ar"], [lang="he"] {
direction: rtl;
unicode-bidi: embed;
}
/* Logical properties — respect writing direction */
.card {
margin-inline-start: 1rem; /* left in LTR, right in RTL */
padding-block-end: 0.5rem; /* bottom in horizontal writing */
}
Special Unicode Characters in HTML/CSS
Some Unicode characters have special behaviour in web contexts:
| Character | Code Point | HTML Use |
|---|---|---|
| Non-breaking space | U+00A0 | — prevents line break |
| Soft hyphen | U+00AD | ­ — invisible optional hyphen |
| Zero-width space | U+200B | ​ — allows line break without visible space |
| Zero-width non-joiner | U+200C | Prevents ligature formation |
| Zero-width joiner | U+200D | Forces ligature / joins emoji sequences |
| Word joiner | U+2060 | Like NBSP but zero-width |
| Left-to-right mark | U+200E | Bidi override for mixed text |
| Right-to-left mark | U+200F | Bidi override for mixed text |
| Object replacement char | U+FFFC | Placeholder for embedded objects |
| Replacement character | U+FFFD | Used for undecodable bytes |
Quick Reference
<!-- Encoding declaration -->
<meta charset="UTF-8">
<!-- Named entity -->
© → …
<!-- Decimal entity -->
→ (→ U+2192)
<!-- Hex entity -->
→ (→ U+2192)
/* CSS escape in content */
a::before { content: "\2192 "; }
/* Font subset via unicode-range */
@font-face {
font-family: "F";
src: url("f-latin.woff2") format("woff2");
unicode-range: U+0000-00FF;
}
Always save HTML files as UTF-8, declare <meta charset="UTF-8">, and prefer
literal Unicode characters over entities for readability. Reserve <,
>, and & for escaping HTML syntax, and use CSS unicode-range to
deliver only the font glyphs that your page actually needs.
Thêm trong Unicode in Code
Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …
JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …
Java's char type is a 16-bit UTF-16 code unit, not a full …
Go's string type is a sequence of bytes, and its rune type …
Rust's str and String types are guaranteed to be valid UTF-8, making …
C and C++ have historically poor Unicode support, with char being a …
Ruby strings carry an explicit encoding, with UTF-8 being the default since …
PHP's built-in string functions operate on bytes rather than Unicode characters, which …
Swift's String type is designed with Unicode correctness as a first-class concern, …
Unicode-aware regular expressions let you match characters by script, category, or property …
SQL databases store text in encodings and collations that determine how characters …
URLs are technically restricted to ASCII characters, so non-ASCII text must be …
Every major programming language has its own syntax for embedding Unicode characters …
JSON is defined as Unicode text and must be encoded in UTF-8, …