What is Unicode? A Complete Guide
Unicode is the universal character encoding standard that assigns a unique number to every character in every language. This complete guide explains what Unicode is, why it was created, and how it powers modern text on the internet.
Unicode is the single most important standard in modern computing that you've probably never thought about — yet it's silently at work every time you send a message, load a web page, or write code. Before Unicode existed, computers around the world spoke dozens of incompatible dialects of text, and moving data between systems was a constant source of garbled output. This guide explains what Unicode is, why it was created, and why it matters to every developer working with text.
The Problem Unicode Solved
In the early days of computing, each manufacturer and country invented its own system for mapping bytes to characters. The American ASCII standard used 7 bits to encode 128 characters — enough for English letters, digits, and punctuation. When computers spread globally, vendors extended ASCII into hundreds of competing "code pages": Windows-1252 for Western Europe, KOI8-R for Russian, GB2312 for Simplified Chinese, Shift-JIS for Japanese.
The result was chaos. A file created on a Japanese system would display as garbage on a French system. Emails sent between countries arrived as strings of question marks and boxes. This phenomenon got a name: mojibake (文字化け) — Japanese for "character transformation" — and it was the universal nightmare of internationalized software.
The root cause was simple: there was no agreement on which number meant which character. Every encoding made up its own rules, and none of them were compatible.
Enter the Unicode Consortium
In 1987, engineers at Xerox and Apple began collaborating on a universal character set. The goal was radical: assign a unique number to every character in every writing system used by humans, past or present. In 1991, the Unicode Consortium was formally incorporated, and Unicode 1.0 was published.
Today the Consortium is a non-profit organization whose members include Apple, Google, Microsoft, IBM, Adobe, Facebook, and dozens of universities and governments. Its full name is the Unicode Consortium, and its primary output is the Unicode Standard — a specification updated regularly (currently at version 15.1) that defines:
- The universal character repertoire (every assigned character)
- Properties of each character (category, directionality, case, combining behavior)
- Algorithms for sorting, rendering, bidirectional text, and line breaking
- Encoding forms (UTF-8, UTF-16, UTF-32)
The Consortium also maintains a parallel standard: ISO/IEC 10646, which defines the same character repertoire. Both are kept in sync and are effectively interchangeable at the code point level.
Code Points: The Foundation
The central concept in Unicode is the code point — a unique integer assigned to each
character. Code points are written in the format U+XXXX where the Xs are hexadecimal digits.
Some examples:
| Character | Code Point | Name |
|---|---|---|
| A | U+0041 | LATIN CAPITAL LETTER A |
| é | U+00E9 | LATIN SMALL LETTER E WITH ACUTE |
| 中 | U+4E2D | CJK UNIFIED IDEOGRAPH-4E2D |
| 😀 | U+1F600 | GRINNING FACE |
| ☃ | U+2603 | SNOWMAN |
| ∞ | U+221E | INFINITY |
Code points range from U+0000 to U+10FFFF, giving a total capacity of 1,114,112 possible
characters. As of Unicode 15.1, approximately 149,813 of those slots are assigned.
Code points are abstract numbers — they say what a character is, but not how it is stored in memory or on disk. That job belongs to the encoding forms (UTF-8, UTF-16, UTF-32), which map code points to actual bytes.
Unicode Planes
The 1,114,112 code points are divided into 17 planes, each containing 65,536 code points:
| Plane | Range | Name | Notable Contents |
|---|---|---|---|
| 0 | U+0000–U+FFFF | Basic Multilingual Plane (BMP) | Latin, Greek, Cyrillic, CJK, most common symbols |
| 1 | U+10000–U+1FFFF | Supplementary Multilingual Plane (SMP) | Emoji, historic scripts, musical notation |
| 2 | U+20000–U+2FFFF | Supplementary Ideographic Plane (SIP) | Rare CJK ideographs |
| 3 | U+30000–U+3FFFF | Tertiary Ideographic Plane (TIP) | Very rare CJK ideographs (added Unicode 13) |
| 4–13 | U+40000–U+DFFFF | Unassigned | Reserved for future use |
| 14 | U+E0000–U+EFFFF | Supplementary Special-purpose Plane | Language tags, variation selectors |
| 15–16 | U+F0000–U+10FFFF | Private Use Area Planes | Custom characters (not interoperable) |
The BMP is by far the most important plane. It covers the characters needed for virtually every modern language in daily use. The SMP is where you'll find emoji (starting around U+1F300) and historic scripts like Linear B, Cuneiform, and Egyptian Hieroglyphs.
The range U+D800–U+DFFF (2,048 code points) is permanently reserved and will never contain characters — these are "surrogate" code points used as a technical mechanism by UTF-16 to encode characters outside the BMP.
Character Properties
Every Unicode character carries a rich set of properties — metadata that tells text rendering engines and programming libraries how to handle it:
- General Category: Is it a letter (L), number (N), punctuation (P), symbol (S), separator (Z), or control character (C)?
- Name: A unique, stable uppercase string like
LATIN CAPITAL LETTER A. - Bidirectional Class: Is it left-to-right (like English), right-to-left (like Arabic), or neutral?
- Combining Class: For diacritical marks, specifies how they combine with base characters.
- Case Mapping: Uppercase, lowercase, and titlecase equivalents.
- Numeric Value: For digit and number characters, the actual numeric value.
- Script: Which writing system the character belongs to (Latin, Arabic, Han, etc.).
- Block: The Unicode block the character is in (Basic Latin, CJK Unified Ideographs, etc.).
In Python, you can access many of these properties via the unicodedata module:
import unicodedata
char = "é" # U+00E9
print(unicodedata.name(char)) # LATIN SMALL LETTER E WITH ACUTE
print(unicodedata.category(char)) # Ll (letter, lowercase)
print(unicodedata.bidirectional(char)) # L (left-to-right)
print(unicodedata.combining(char)) # 0 (not a combining mark)
print(unicodedata.normalize("NFD", char)) # e + U+0301 (combining acute accent)
Unicode vs. Encoding: A Critical Distinction
One of the most common sources of confusion is conflating Unicode with a specific encoding. Unicode is the standard; UTF-8, UTF-16, and UTF-32 are encodings — different ways to serialize code points as bytes.
Think of it this way:
- Unicode defines that the character "A" has code point U+0041 (the number 65).
- UTF-8 encodes U+0041 as the single byte 0x41.
- UTF-16 encodes U+0041 as the two bytes 0x00 0x41 (big-endian) or 0x41 0x00 (little-endian).
- UTF-32 encodes U+0041 as the four bytes 0x00 0x00 0x00 0x41.
All three encodings represent the exact same character — they just store it differently in memory. UTF-8 is by far the most prevalent on the web and in file storage; UTF-16 is used internally by Windows, Java, and JavaScript; UTF-32 is rare but useful for constant-time indexing.
Unicode in Practice
Python
Python 3's str type is a sequence of Unicode code points. The interpreter handles encoding
internally:
text = "Hello, 世界! 😀"
print(len(text)) # 12 — code points, not bytes
print(text.encode("utf-8")) # b'Hello, \xe4\xb8\x96\xe7\x95\x8c! \xf0\x9f\x98\x80'
print(text.encode("utf-16")) # includes BOM + 2 bytes per BMP char, 4 for emoji
JavaScript
JavaScript strings are sequences of UTF-16 code units. Characters outside the BMP (like most
emoji) take two code units ("surrogate pairs"), which can trip up .length:
const text = "Hello 😀";
console.log(text.length); // 8 — counts UTF-16 code units (emoji = 2)
console.log([...text].length); // 7 — spread iterates by code point
Web (HTML/HTTP)
Always declare your encoding. For HTML5:
<meta charset="UTF-8">
For HTTP responses, set the Content-Type header:
Content-Type: text/html; charset=utf-8
Without an explicit declaration, browsers may misdetect the encoding and display garbage.
Key Takeaways
- Unicode assigns a unique integer (code point, written
U+XXXX) to every character in every human writing system. - Mojibake — garbled text from encoding mismatches — was the problem Unicode was created to solve.
- The Unicode code point space covers U+0000 to U+10FFFF (1,114,112 slots across 17 planes).
- Plane 0 (BMP) covers modern languages; Plane 1 (SMP) covers emoji and historic scripts.
- Unicode defines characters and their properties; UTF-8/UTF-16/UTF-32 are the encodings that serialize those characters as bytes.
- UTF-8 has won the web: over 98% of web pages use it.
- Use
unicodedatain Python to introspect character name, category, and other properties.
Plus dans Unicode Fundamentals
UTF-8 is the dominant character encoding on the web, capable of representing …
UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …
A Unicode code point is the unique number assigned to each character …
Unicode is divided into 17 planes, each containing up to 65,536 code …
The Byte Order Mark (BOM) is a special Unicode character used at …
Surrogate pairs are a mechanism in UTF-16 that allows code points outside …
ASCII defined 128 characters for the English alphabet and was the foundation …
The same visible character can be represented by multiple different byte sequences …
The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …
Every Unicode character belongs to a general category such as Letter, Number, …
Unicode blocks are contiguous ranges of code points grouped by script or …
Unicode assigns every character to a script property that identifies the writing …
Combining characters are Unicode code points that attach to a preceding base …
A single visible character on screen — called a grapheme cluster — …
Unicode confusables are characters that look identical or nearly identical to others, …
Zero-width characters are invisible Unicode code points that affect text layout, joining, …
Unicode defines over two dozen whitespace characters beyond the ordinary space, including …
Unicode began in 1987 as a collaboration between engineers at Apple and …
Unicode has released major versions regularly since 1.0 in 1991, with each …