📚 Unicode Fundamentals

What is Unicode? A Complete Guide

Unicode is the universal character encoding standard that assigns a unique number to every character in every language. This complete guide explains what Unicode is, why it was created, and how it powers modern text on the internet.

Published 2021-03-15 · Updated 2024-11-08

Unicode is the single most important standard in modern computing that you've probably never thought about — yet it's silently at work every time you send a message, load a web page, or write code. Before Unicode existed, computers around the world spoke dozens of incompatible dialects of text, and moving data between systems was a constant source of garbled output. This guide explains what Unicode is, why it was created, and why it matters to every developer working with text.

The Problem Unicode Solved

In the early days of computing, each manufacturer and country invented its own system for mapping bytes to characters. The American ASCII standard used 7 bits to encode 128 characters — enough for English letters, digits, and punctuation. When computers spread globally, vendors extended ASCII into hundreds of competing "code pages": Windows-1252 for Western Europe, KOI8-R for Russian, GB2312 for Simplified Chinese, Shift-JIS for Japanese.

The result was chaos. A file created on a Japanese system would display as garbage on a French system. Emails sent between countries arrived as strings of question marks and boxes. This phenomenon got a name: mojibake (文字化け) — Japanese for "character transformation" — and it was the universal nightmare of internationalized software.

The root cause was simple: there was no agreement on which number meant which character. Every encoding made up its own rules, and none of them were compatible.

Enter the Unicode Consortium

In 1987, engineers at Xerox and Apple began collaborating on a universal character set. The goal was radical: assign a unique number to every character in every writing system used by humans, past or present. In 1991, the Unicode Consortium was formally incorporated, and Unicode 1.0 was published.

Today the Consortium is a non-profit organization whose members include Apple, Google, Microsoft, IBM, Adobe, Facebook, and dozens of universities and governments. Its full name is the Unicode Consortium, and its primary output is the Unicode Standard — a specification updated regularly (currently at version 15.1) that defines:

The universal character repertoire (every assigned character)
Properties of each character (category, directionality, case, combining behavior)
Algorithms for sorting, rendering, bidirectional text, and line breaking
Encoding forms (UTF-8, UTF-16, UTF-32)

The Consortium also maintains a parallel standard: ISO/IEC 10646, which defines the same character repertoire. Both are kept in sync and are effectively interchangeable at the code point level.

Code Points: The Foundation

The central concept in Unicode is the code point — a unique integer assigned to each character. Code points are written in the format U+XXXX where the Xs are hexadecimal digits.

Some examples:

Character	Code Point	Name
A	U+0041	LATIN CAPITAL LETTER A
é	U+00E9	LATIN SMALL LETTER E WITH ACUTE
中	U+4E2D	CJK UNIFIED IDEOGRAPH-4E2D
😀	U+1F600	GRINNING FACE
☃	U+2603	SNOWMAN
∞	U+221E	INFINITY

Code points range from U+0000 to U+10FFFF, giving a total capacity of 1,114,112 possible characters. As of Unicode 15.1, approximately 149,813 of those slots are assigned.

Code points are abstract numbers — they say what a character is, but not how it is stored in memory or on disk. That job belongs to the encoding forms (UTF-8, UTF-16, UTF-32), which map code points to actual bytes.

Unicode Planes

The 1,114,112 code points are divided into 17 planes, each containing 65,536 code points:

Plane	Range	Name	Notable Contents
0	U+0000–U+FFFF	Basic Multilingual Plane (BMP)	Latin, Greek, Cyrillic, CJK, most common symbols
1	U+10000–U+1FFFF	Supplementary Multilingual Plane (SMP)	Emoji, historic scripts, musical notation
2	U+20000–U+2FFFF	Supplementary Ideographic Plane (SIP)	Rare CJK ideographs
3	U+30000–U+3FFFF	Tertiary Ideographic Plane (TIP)	Very rare CJK ideographs (added Unicode 13)
4–13	U+40000–U+DFFFF	Unassigned	Reserved for future use
14	U+E0000–U+EFFFF	Supplementary Special-purpose Plane	Language tags, variation selectors
15–16	U+F0000–U+10FFFF	Private Use Area Planes	Custom characters (not interoperable)

The BMP is by far the most important plane. It covers the characters needed for virtually every modern language in daily use. The SMP is where you'll find emoji (starting around U+1F300) and historic scripts like Linear B, Cuneiform, and Egyptian Hieroglyphs.

The range U+D800–U+DFFF (2,048 code points) is permanently reserved and will never contain characters — these are "surrogate" code points used as a technical mechanism by UTF-16 to encode characters outside the BMP.

Character Properties

Every Unicode character carries a rich set of properties — metadata that tells text rendering engines and programming libraries how to handle it:

General Category: Is it a letter (L), number (N), punctuation (P), symbol (S), separator (Z), or control character (C)?
Name: A unique, stable uppercase string like LATIN CAPITAL LETTER A.
Bidirectional Class: Is it left-to-right (like English), right-to-left (like Arabic), or neutral?
Combining Class: For diacritical marks, specifies how they combine with base characters.
Case Mapping: Uppercase, lowercase, and titlecase equivalents.
Numeric Value: For digit and number characters, the actual numeric value.
Script: Which writing system the character belongs to (Latin, Arabic, Han, etc.).
Block: The Unicode block the character is in (Basic Latin, CJK Unified Ideographs, etc.).

In Python, you can access many of these properties via the unicodedata module:

import unicodedata

char = "é"  # U+00E9

print(unicodedata.name(char))           # LATIN SMALL LETTER E WITH ACUTE
print(unicodedata.category(char))       # Ll  (letter, lowercase)
print(unicodedata.bidirectional(char))  # L   (left-to-right)
print(unicodedata.combining(char))      # 0   (not a combining mark)
print(unicodedata.normalize("NFD", char))  # e + U+0301 (combining acute accent)

Unicode vs. Encoding: A Critical Distinction

One of the most common sources of confusion is conflating Unicode with a specific encoding. Unicode is the standard; UTF-8, UTF-16, and UTF-32 are encodings — different ways to serialize code points as bytes.

Think of it this way: - Unicode defines that the character "A" has code point U+0041 (the number 65). - UTF-8 encodes U+0041 as the single byte 0x41. - UTF-16 encodes U+0041 as the two bytes 0x00 0x41 (big-endian) or 0x41 0x00 (little-endian). - UTF-32 encodes U+0041 as the four bytes 0x00 0x00 0x00 0x41.

All three encodings represent the exact same character — they just store it differently in memory. UTF-8 is by far the most prevalent on the web and in file storage; UTF-16 is used internally by Windows, Java, and JavaScript; UTF-32 is rare but useful for constant-time indexing.

Unicode in Practice

Python

Python 3's str type is a sequence of Unicode code points. The interpreter handles encoding internally:

text = "Hello, 世界! 😀"
print(len(text))          # 12 — code points, not bytes
print(text.encode("utf-8"))   # b'Hello, \xe4\xb8\x96\xe7\x95\x8c! \xf0\x9f\x98\x80'
print(text.encode("utf-16"))  # includes BOM + 2 bytes per BMP char, 4 for emoji

JavaScript

JavaScript strings are sequences of UTF-16 code units. Characters outside the BMP (like most emoji) take two code units ("surrogate pairs"), which can trip up .length:

const text = "Hello 😀";
console.log(text.length);        // 8 — counts UTF-16 code units (emoji = 2)
console.log([...text].length);   // 7 — spread iterates by code point

Web (HTML/HTTP)

Always declare your encoding. For HTML5:

<meta charset="UTF-8">

For HTTP responses, set the Content-Type header:

Content-Type: text/html; charset=utf-8

Without an explicit declaration, browsers may misdetect the encoding and display garbage.

Key Takeaways

Unicode assigns a unique integer (code point, written U+XXXX) to every character in every human writing system.
Mojibake — garbled text from encoding mismatches — was the problem Unicode was created to solve.
The Unicode code point space covers U+0000 to U+10FFFF (1,114,112 slots across 17 planes).
Plane 0 (BMP) covers modern languages; Plane 1 (SMP) covers emoji and historic scripts.
Unicode defines characters and their properties; UTF-8/UTF-16/UTF-32 are the encodings that serialize those characters as bytes.
UTF-8 has won the web: over 98% of web pages use it.
Use unicodedata in Python to introspect character name, category, and other properties.