Unicode for the Modern Web · अध्याय 1

HTML and Unicode: Entities, Escapes, and Encoding

Getting Unicode right in HTML starts with the correct meta charset declaration. This chapter covers named vs numeric entities, when to use escapes, and how browsers decode character encoding.

~3500 शब्द · ~14 मिनट पढ़ें · · Updated

Every web page you have ever built rests on a single invisible agreement: the browser and the server must agree on which byte sequences map to which characters. When that agreement breaks down you get mojibake — the infamous strings of garbled characters that plagued the early web. Today that agreement is almost universal: UTF-8, declared in a single line of HTML. But knowing why that line matters, and understanding the full ecosystem of HTML character representations, will save you from subtle bugs that still trap developers every day.

The <meta charset="UTF-8"> Tag

The most important line in any HTML document is probably this:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>My Page</title>
</head>

The <meta charset> tag tells the browser how to decode the raw bytes that make up your HTML file. It must appear within the first 1024 bytes of the document — browsers sniff encoding before they finish parsing the <head>. If the declaration appears too late, or is missing entirely, the browser falls back to heuristics or a locale-based default, which is how ISO-8859-1 mojibake still appears on poorly-maintained sites.

HTML5 makes UTF-8 the recommended (and effectively default) encoding. But "recommended" is not "enforced." The browser's behavior in the absence of a declaration depends on the HTTP Content-Type header.

Content-Type Header Interaction

The HTTP response header takes priority over the in-document <meta> tag:

Content-Type: text/html; charset=UTF-8

The precedence chain, from highest to lowest:

  1. HTTP Content-Type header charset parameter
  2. BOM at the start of the file (U+FEFF in UTF-8, UTF-16 LE/BE)
  3. <meta charset> or <meta http-equiv="Content-Type"> in the document
  4. Browser heuristics / locale default

When your server sends charset=ISO-8859-1 in the header but your file is UTF-8, the <meta charset="UTF-8"> is ignored. Configure your server to send the correct header. In Nginx:

charset utf-8;
charset_types text/html text/css application/javascript application/json;

In Django, this is handled automatically — responses are text/html; charset=utf-8 by default.

HTML Entities: Named and Numeric

HTML provides three ways to represent a character as an escape sequence rather than a literal byte.

Named entities use a mnemonic reference:

&amp;    <!-- & (U+0026) -->
&lt;     <!-- < (U+003C) -->
&gt;     <!-- > (U+003E) -->
&quot;   <!-- " (U+0022) -->
&apos;   <!-- ' (U+0027) — HTML5 only -->
&nbsp;   <!-- non-breaking space (U+00A0) -->
&copy;   <!-- © (U+00A9) -->
&mdash;  <!-- — (U+2014) em dash -->

Decimal numeric references use the code point as a base-10 integer:

&#38;    <!-- & -->
&#169;   <!-- © -->
&#128512; <!-- 😀 U+1F600 -->

Hexadecimal numeric references use the x prefix:

&#x26;   <!-- & -->
&#xA9;   <!-- © -->
&#x1F600; <!-- 😀 -->

All three forms are semantically equivalent to inserting the literal character. The browser decodes them identically.

When to Use Entities vs Direct Unicode

The old rule — "escape everything non-ASCII" — was born from ASCII-only file systems and broken tooling. With UTF-8 throughout your stack, the modern rule is simpler:

Always escape these, regardless of encoding: - &amp; for & in HTML text and attribute values - &lt; and &gt; for < and > in text content - &quot; inside double-quoted attributes

Never escape these unnecessarily: - Emoji, CJK characters, accented Latin — use them literally - Symbols like © ™ — use them literally in UTF-8 files

Unnecessary escaping adds noise, breaks copy-paste, and makes templates harder to read. A file full of &#x4E2D;&#x6587; instead of 中文 is hostile to human editors.

The one remaining use case for numeric escapes is when you cannot control the file encoding — for example, injecting content into a legacy system that might re-encode your file. In that case, escaping to ASCII-safe entities guarantees survival.

Form Submission Encoding

HTML forms submit data encoded according to the accept-charset attribute:

<form method="post" action="/submit" accept-charset="UTF-8">
  <input type="text" name="username">
</form>

When accept-charset is omitted, the form uses the document's encoding — which, if you declared <meta charset="UTF-8">, is UTF-8. Always declare your document encoding and you never need accept-charset.

The enctype attribute controls the wire format but not the character encoding: - application/x-www-form-urlencoded (default) — percent-encodes non-ASCII - multipart/form-data — raw bytes, required for file uploads - text/plain — no encoding, for email forms (avoid in production)

In application/x-www-form-urlencoded, the browser percent-encodes the UTF-8 bytes of each character. The emoji 😀 (U+1F600) encodes to the four UTF-8 bytes F0 9F 98 80, which become %F0%9F%98%80. Your server must decode this percent-encoding and then interpret the result as UTF-8.

The BOM in HTML Files

The UTF-8 BOM is the three-byte sequence EF BB BF at the very start of a file. It is a zero-width no-break space character (U+FEFF) encoded in UTF-8.

In UTF-8 HTML files the BOM is unnecessary — UTF-8 has no endianness ambiguity, so the BOM carries no information. Worse, it can cause problems:

  • PHP scripts with a BOM send output before any headers, breaking header() calls
  • Some editors display a spurious character at the top of the page
  • The XML declaration <?xml ... ?> must be the absolute first bytes — a BOM breaks XML parsing

Best practice: Save HTML files as UTF-8 without BOM. Most modern editors (VS Code, Vim, Sublime) default to this. If you receive files with BOMs, strip them during your build process.

HTML5's UTF-8 Default and Practical Checklist

HTML5 specifies that user agents must default to UTF-8 when no encoding is declared and heuristics are inconclusive. In practice, always declare explicitly. Here is the full checklist for correct encoding:

<!-- 1. Declare encoding early in <head> -->
<meta charset="UTF-8">

<!-- 2. Set the document language for rendering and accessibility -->
<html lang="en">

<!-- 3. Set direction if needed -->
<html lang="ar" dir="rtl">

On the server side:

# Nginx
charset utf-8;

# Apache .htaccess
AddDefaultCharset UTF-8

# Django (automatic) — verify with:
# response['Content-Type']  → 'text/html; charset=utf-8'

And in your database connection string, ensure the connection charset is also UTF-8 — otherwise characters survive the file and network layers but get mangled at the DB boundary.

Special Cases: SVG and XML Embedded in HTML

Inline SVG in HTML5 documents inherits the document encoding. Standalone .svg files served as image/svg+xml are XML and can declare their own encoding:

<?xml version="1.0" encoding="UTF-8"?>
<svg xmlns="http://www.w3.org/2000/svg" ...>

When you embed SVG data URIs in CSS or HTML attributes, you must percent-encode characters that are special in URLs. The easiest path is to base64-encode the SVG, though this prevents easy editing.

Understanding the encoding stack — file, HTTP header, HTML declaration, form submission, database connection — lets you reason precisely about where characters might be corrupted. In a correctly configured UTF-8 stack, the answer is: nowhere.