💻 Unicode in Code

Unicode in HTML & CSS

HTML and CSS support Unicode characters directly and through escape sequences, allowing developers to embed any character in web pages without encoding issues. This guide covers the charset meta tag, HTML entity references, CSS unicode-range, and how to insert special characters in markup and styles.

Published 2022-06-06 · Updated 2025-03-17

HTML and CSS have supported Unicode from their earliest days, but the toolset has grown considerably. Correct encoding declarations, HTML entities, CSS content values, and @font-face unicode-range descriptors all interact with Unicode in different ways. This guide walks through every layer, from the <meta charset> tag to advanced CSS font subsetting.

Declaring the Character Encoding in HTML

Every HTML document should declare its character encoding in the first 1,024 bytes so that browsers can parse the document before encountering any non-ASCII characters. The standard way in HTML5 is:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>My Page</title>
</head>

The <meta charset> tag must appear before any non-ASCII content or <title> element. UTF-8 is the only encoding mandated by the WHATWG HTML standard for new documents.

Servers should also send the correct Content-Type header:

Content-Type: text/html; charset=UTF-8

If both the HTTP header and the <meta> tag are present, the HTTP header takes priority (except for local files opened without a server, where the meta tag is used). For consistency, keep both in sync.

HTML Entities

HTML entities let you include special characters using ASCII-safe syntax. They are useful when:

You need to include characters that have special meaning in HTML (<, >, &)
Your editor or CMS cannot easily insert certain characters
You want to make the source human-readable

Named Entities

Named entities use a mnemonic name between & and ;:

&lt;        <!-- < LESS-THAN SIGN -->
&gt;        <!-- > GREATER-THAN SIGN -->
&amp;       <!-- & AMPERSAND -->
&quot;      <!-- " QUOTATION MARK -->
&apos;      <!-- ' APOSTROPHE (HTML5) -->
&nbsp;      <!-- non-breaking space U+00A0 -->
&copy;      <!-- © COPYRIGHT SIGN U+00A9 -->
&reg;       <!-- ® REGISTERED SIGN U+00AE -->
&trade;     <!-- ™ TRADE MARK SIGN U+2122 -->
&euro;      <!-- € EURO SIGN U+20AC -->
&mdash;     <!-- — EM DASH U+2014 -->
&hellip;    <!-- … HORIZONTAL ELLIPSIS U+2026 -->
&rarr;      <!-- → RIGHTWARDS ARROW U+2192 -->
&larr;      <!-- ← LEFTWARDS ARROW U+2190 -->
&hearts;    <!-- ♥ BLACK HEART SUIT U+2665 -->

HTML5 defines over 2,000 named entities. A complete list is in the WHATWG named character references table.

Numeric Decimal Entities

Any Unicode code point can be written as a decimal numeric entity using the format &#NNNN;:

&#169;      <!-- © U+00A9 COPYRIGHT SIGN -->
&#8594;     <!-- → U+2192 RIGHTWARDS ARROW -->
&#128013;   <!-- 🐍 U+1F40D SNAKE -->

The decimal value is simply the integer value of the code point.

Numeric Hexadecimal Entities

Hex entities use &#xHHHH; (case-insensitive x):

&#xA9;      <!-- © U+00A9 -->
&#x2192;    <!-- → U+2192 -->
&#x1F40D;   <!-- 🐍 U+1F40D -->

Hexadecimal entities are more common among developers because code points are conventionally written in hex (U+2192). The U+ prefix is not valid HTML — use &#x instead.

When to Use Entities vs. Literal Characters

Situation	Recommendation
`<`, `>`, `&` in text content	Always use `<`, `>`, `&`
`"` inside attribute values	Use `"` or switch to single quotes
Non-ASCII in UTF-8 document	Prefer literal character (simpler, more readable)
Non-ASCII in ASCII-encoded document	Use numeric entity
Control characters (U+0000–U+001F)	Use numeric entity (raw control chars are invalid in HTML)

In a modern UTF-8 document you can write ©, →, and 🐍 directly in the source without any entity syntax.

Unicode in CSS

The `content` Property

The CSS content property (used with ::before and ::after) supports Unicode via an escape sequence using a backslash followed by 1–6 hex digits:

/* Add a right arrow before each link */
a::before {
  content: "\2192 ";   /* → U+2192 RIGHTWARDS ARROW */
}

/* Decorative quotation marks */
blockquote::before { content: "\201C"; }  /* " U+201C LEFT DOUBLE QUOTATION MARK */
blockquote::after  { content: "\201D"; }  /* " U+201D RIGHT DOUBLE QUOTATION MARK */

/* Emoji (supplementary plane) */
.warning::before {
  content: "\26A0\\FE0F ";  /* ⚠️ U+26A0 + U+FE0F variation selector */
}

Note: In CSS, the escape ends at the first character that is not a valid hex digit, or you can terminate it with a space (which is consumed). To include a literal space after the character, add two spaces or use \20 (the hex for space):

/* Two ways to get "→ " (arrow + space) */
content: "\2192 ";      /* space terminates the escape, then another space */
content: "\2192\20";    /* explicit U+0020 SPACE */

`unicode-range` in `@font-face`

The unicode-range descriptor in @font-face tells the browser which code points a font file covers. The browser only downloads the font file if the page actually uses one of those code points — this is the basis of font subsetting in variable font stacks and icon fonts.

/* Load a Latin supplement font only when Latin Extended characters are needed */
@font-face {
  font-family: "MyFont";
  src: url("myfont-latin-ext.woff2") format("woff2");
  unicode-range: U+0100-024F, U+0259, U+1E00-1EFF, U+2020, U+20A0-20AB,
                 U+20AD-20CF, U+2113, U+2C60-2C7F, U+A720-A7FF;
}

/* Separate file for CJK — only downloaded on Chinese/Japanese/Korean pages */
@font-face {
  font-family: "MyFont";
  src: url("myfont-cjk.woff2") format("woff2");
  unicode-range: U+3000-9FFF, U+F900-FAFF, U+FE30-FE4F;
}

Google Fonts uses exactly this technique — inspecting the CSS it delivers reveals dozens of @font-face rules for different subsets (latin, latin-ext, cyrillic, greek, vietnamese, etc.), each with a precise unicode-range.

Range Syntax

Syntax	Example	Meaning
Single code point	`U+2192`	Only U+2192
Range	`U+2190-21FF`	U+2190 through U+21FF
Wildcard	`U+25??`	U+2500 through U+25FF (`?` = any hex digit)
List	`U+0020, U+0041`	U+0020 and U+0041 only

Combining Range Descriptors

@font-face {
  font-family: "Icons";
  src: url("icons.woff2") format("woff2");
  /* PUA block commonly used for icon fonts */
  unicode-range: U+E000-F8FF;
}

Language and Text Direction

HTML provides attributes to assist browsers and screen readers with Unicode text:

<!-- Language tag: affects font selection, hyphenation, spell-check -->
<html lang="ja">

<!-- Inline language switch -->
<p>The Japanese word for "cat" is <span lang="ja">猫</span>.</p>

<!-- Right-to-left text (Arabic, Hebrew) -->
<p dir="rtl" lang="ar">مرحبا بالعالم</p>

<!-- Bidirectional isolation — prevents RTL bleed-through -->
<bdi>مرحبا</bdi>

<!-- Force a specific direction -->
<bdo dir="ltr">مرحبا</bdo>

CSS Direction Properties

/* For RTL languages */
[lang="ar"], [lang="he"] {
  direction: rtl;
  unicode-bidi: embed;
}

/* Logical properties — respect writing direction */
.card {
  margin-inline-start: 1rem;   /* left in LTR, right in RTL */
  padding-block-end: 0.5rem;  /* bottom in horizontal writing */
}

Special Unicode Characters in HTML/CSS

Some Unicode characters have special behaviour in web contexts:

Character	Code Point	HTML Use
Non-breaking space	U+00A0	` ` — prevents line break
Soft hyphen	U+00AD	`` — invisible optional hyphen
Zero-width space	U+200B	`` — allows line break without visible space
Zero-width non-joiner	U+200C	Prevents ligature formation
Zero-width joiner	U+200D	Forces ligature / joins emoji sequences
Word joiner	U+2060	Like NBSP but zero-width
Left-to-right mark	U+200E	Bidi override for mixed text
Right-to-left mark	U+200F	Bidi override for mixed text
Object replacement char	U+FFFC	Placeholder for embedded objects
Replacement character	U+FFFD	Used for undecodable bytes

Quick Reference

<!-- Encoding declaration -->
<meta charset="UTF-8">

<!-- Named entity -->
&copy;   &rarr;   &hellip;

<!-- Decimal entity -->
&#8594;   (→ U+2192)

<!-- Hex entity -->
&#x2192;  (→ U+2192)

/* CSS escape in content */
a::before { content: "\2192 "; }

/* Font subset via unicode-range */
@font-face {
  font-family: "F";
  src: url("f-latin.woff2") format("woff2");
  unicode-range: U+0000-00FF;
}

Always save HTML files as UTF-8, declare <meta charset="UTF-8">, and prefer literal Unicode characters over entities for readability. Reserve <, >, and & for escaping HTML syntax, and use CSS unicode-range to deliver only the font glyphs that your page actually needs.

Thêm trong Unicode in Code

Unicode in Python

Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …

Unicode in JavaScript

JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …

Unicode in Java

Java's char type is a 16-bit UTF-16 code unit, not a full …

Unicode in Go

Go's string type is a sequence of bytes, and its rune type …

Unicode in Rust

Rust's str and String types are guaranteed to be valid UTF-8, making …

Unicode in C/C++

C and C++ have historically poor Unicode support, with char being a …

Unicode in Ruby

Ruby strings carry an explicit encoding, with UTF-8 being the default since …

Unicode in PHP

PHP's built-in string functions operate on bytes rather than Unicode characters, which …

Unicode in Swift

Swift's String type is designed with Unicode correctness as a first-class concern, …

Unicode in Regular Expressions

Unicode-aware regular expressions let you match characters by script, category, or property …

Unicode in SQL

SQL databases store text in encodings and collations that determine how characters …

Unicode in URLs

URLs are technically restricted to ASCII characters, so non-ASCII text must be …

Unicode Escape Sequences: Cross-Language Reference

Every major programming language has its own syntax for embedding Unicode characters …

How to Handle Unicode in APIs and JSON

JSON is defined as Unicode text and must be encoded in UTF-8, …

← Quay lại Hướng dẫn