💻 Unicode in Code

Unicode in HTML & CSS

HTML and CSS support Unicode characters directly and through escape sequences, allowing developers to embed any character in web pages without encoding issues. This guide covers the charset meta tag, HTML entity references, CSS unicode-range, and how to insert special characters in markup and styles.

·

HTML and CSS have supported Unicode from their earliest days, but the toolset has grown considerably. Correct encoding declarations, HTML entities, CSS content values, and @font-face unicode-range descriptors all interact with Unicode in different ways. This guide walks through every layer, from the <meta charset> tag to advanced CSS font subsetting.

Declaring the Character Encoding in HTML

Every HTML document should declare its character encoding in the first 1,024 bytes so that browsers can parse the document before encountering any non-ASCII characters. The standard way in HTML5 is:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>My Page</title>
</head>

The <meta charset> tag must appear before any non-ASCII content or <title> element. UTF-8 is the only encoding mandated by the WHATWG HTML standard for new documents.

Servers should also send the correct Content-Type header:

Content-Type: text/html; charset=UTF-8

If both the HTTP header and the <meta> tag are present, the HTTP header takes priority (except for local files opened without a server, where the meta tag is used). For consistency, keep both in sync.

HTML Entities

HTML entities let you include special characters using ASCII-safe syntax. They are useful when:

  • You need to include characters that have special meaning in HTML (<, >, &)
  • Your editor or CMS cannot easily insert certain characters
  • You want to make the source human-readable

Named Entities

Named entities use a mnemonic name between & and ;:

&lt;        <!-- < LESS-THAN SIGN -->
&gt;        <!-- > GREATER-THAN SIGN -->
&amp;       <!-- & AMPERSAND -->
&quot;      <!-- " QUOTATION MARK -->
&apos;      <!-- ' APOSTROPHE (HTML5) -->
&nbsp;      <!-- non-breaking space U+00A0 -->
&copy;      <!-- © COPYRIGHT SIGN U+00A9 -->
&reg;       <!-- ® REGISTERED SIGN U+00AE -->
&trade;     <!-- ™ TRADE MARK SIGN U+2122 -->
&euro;      <!-- € EURO SIGN U+20AC -->
&mdash;     <!-- — EM DASH U+2014 -->
&hellip;    <!-- … HORIZONTAL ELLIPSIS U+2026 -->
&rarr;      <!-- → RIGHTWARDS ARROW U+2192 -->
&larr;      <!-- ← LEFTWARDS ARROW U+2190 -->
&hearts;    <!-- ♥ BLACK HEART SUIT U+2665 -->

HTML5 defines over 2,000 named entities. A complete list is in the WHATWG named character references table.

Numeric Decimal Entities

Any Unicode code point can be written as a decimal numeric entity using the format &#NNNN;:

&#169;      <!-- © U+00A9 COPYRIGHT SIGN -->
&#8594;     <!-- → U+2192 RIGHTWARDS ARROW -->
&#128013;   <!-- 🐍 U+1F40D SNAKE -->

The decimal value is simply the integer value of the code point.

Numeric Hexadecimal Entities

Hex entities use &#xHHHH; (case-insensitive x):

&#xA9;      <!-- © U+00A9 -->
&#x2192;    <!-- → U+2192 -->
&#x1F40D;   <!-- 🐍 U+1F40D -->

Hexadecimal entities are more common among developers because code points are conventionally written in hex (U+2192). The U+ prefix is not valid HTML — use &#x instead.

When to Use Entities vs. Literal Characters

Situation Recommendation
<, >, & in text content Always use &lt;, &gt;, &amp;
" inside attribute values Use &quot; or switch to single quotes
Non-ASCII in UTF-8 document Prefer literal character (simpler, more readable)
Non-ASCII in ASCII-encoded document Use numeric entity
Control characters (U+0000–U+001F) Use numeric entity (raw control chars are invalid in HTML)

In a modern UTF-8 document you can write ©, , and 🐍 directly in the source without any entity syntax.

Unicode in CSS

The content Property

The CSS content property (used with ::before and ::after) supports Unicode via an escape sequence using a backslash followed by 1–6 hex digits:

/* Add a right arrow before each link */
a::before {
  content: "\2192 ";   /* → U+2192 RIGHTWARDS ARROW */
}

/* Decorative quotation marks */
blockquote::before { content: "\201C"; }  /* " U+201C LEFT DOUBLE QUOTATION MARK */
blockquote::after  { content: "\201D"; }  /* " U+201D RIGHT DOUBLE QUOTATION MARK */

/* Emoji (supplementary plane) */
.warning::before {
  content: "\26A0\\FE0F ";  /* ⚠️ U+26A0 + U+FE0F variation selector */
}

Note: In CSS, the escape ends at the first character that is not a valid hex digit, or you can terminate it with a space (which is consumed). To include a literal space after the character, add two spaces or use \20 (the hex for space):

/* Two ways to get "→ " (arrow + space) */
content: "\2192 ";      /* space terminates the escape, then another space */
content: "\2192\20";    /* explicit U+0020 SPACE */

unicode-range in @font-face

The unicode-range descriptor in @font-face tells the browser which code points a font file covers. The browser only downloads the font file if the page actually uses one of those code points — this is the basis of font subsetting in variable font stacks and icon fonts.

/* Load a Latin supplement font only when Latin Extended characters are needed */
@font-face {
  font-family: "MyFont";
  src: url("myfont-latin-ext.woff2") format("woff2");
  unicode-range: U+0100-024F, U+0259, U+1E00-1EFF, U+2020, U+20A0-20AB,
                 U+20AD-20CF, U+2113, U+2C60-2C7F, U+A720-A7FF;
}

/* Separate file for CJK — only downloaded on Chinese/Japanese/Korean pages */
@font-face {
  font-family: "MyFont";
  src: url("myfont-cjk.woff2") format("woff2");
  unicode-range: U+3000-9FFF, U+F900-FAFF, U+FE30-FE4F;
}

Google Fonts uses exactly this technique — inspecting the CSS it delivers reveals dozens of @font-face rules for different subsets (latin, latin-ext, cyrillic, greek, vietnamese, etc.), each with a precise unicode-range.

Range Syntax

Syntax Example Meaning
Single code point U+2192 Only U+2192
Range U+2190-21FF U+2190 through U+21FF
Wildcard U+25?? U+2500 through U+25FF (? = any hex digit)
List U+0020, U+0041 U+0020 and U+0041 only

Combining Range Descriptors

@font-face {
  font-family: "Icons";
  src: url("icons.woff2") format("woff2");
  /* PUA block commonly used for icon fonts */
  unicode-range: U+E000-F8FF;
}

Language and Text Direction

HTML provides attributes to assist browsers and screen readers with Unicode text:

<!-- Language tag: affects font selection, hyphenation, spell-check -->
<html lang="ja">

<!-- Inline language switch -->
<p>The Japanese word for "cat" is <span lang="ja">猫</span>.</p>

<!-- Right-to-left text (Arabic, Hebrew) -->
<p dir="rtl" lang="ar">مرحبا بالعالم</p>

<!-- Bidirectional isolation — prevents RTL bleed-through -->
<bdi>مرحبا</bdi>

<!-- Force a specific direction -->
<bdo dir="ltr">مرحبا</bdo>

CSS Direction Properties

/* For RTL languages */
[lang="ar"], [lang="he"] {
  direction: rtl;
  unicode-bidi: embed;
}

/* Logical properties — respect writing direction */
.card {
  margin-inline-start: 1rem;   /* left in LTR, right in RTL */
  padding-block-end: 0.5rem;  /* bottom in horizontal writing */
}

Special Unicode Characters in HTML/CSS

Some Unicode characters have special behaviour in web contexts:

Character Code Point HTML Use
Non-breaking space U+00A0 &nbsp; — prevents line break
Soft hyphen U+00AD &shy; — invisible optional hyphen
Zero-width space U+200B &#x200B; — allows line break without visible space
Zero-width non-joiner U+200C Prevents ligature formation
Zero-width joiner U+200D Forces ligature / joins emoji sequences
Word joiner U+2060 Like NBSP but zero-width
Left-to-right mark U+200E Bidi override for mixed text
Right-to-left mark U+200F Bidi override for mixed text
Object replacement char U+FFFC Placeholder for embedded objects
Replacement character U+FFFD Used for undecodable bytes

Quick Reference

<!-- Encoding declaration -->
<meta charset="UTF-8">

<!-- Named entity -->
&copy;   &rarr;   &hellip;

<!-- Decimal entity -->
&#8594;   (→ U+2192)

<!-- Hex entity -->
&#x2192;  (→ U+2192)
/* CSS escape in content */
a::before { content: "\2192 "; }

/* Font subset via unicode-range */
@font-face {
  font-family: "F";
  src: url("f-latin.woff2") format("woff2");
  unicode-range: U+0000-00FF;
}

Always save HTML files as UTF-8, declare <meta charset="UTF-8">, and prefer literal Unicode characters over entities for readability. Reserve &lt;, &gt;, and &amp; for escaping HTML syntax, and use CSS unicode-range to deliver only the font glyphs that your page actually needs.

Lainnya di Unicode in Code