📚 Unicode Fundamentals

What is a Unicode Code Point?

A Unicode code point is the unique number assigned to each character in the Unicode standard, written in the form U+0041. This guide explains what code points are, how they are structured, and how they relate to the bytes stored in a file.

Published 2021-04-28 · Updated 2024-07-15

Every character you see on screen — a Latin letter, a Chinese ideograph, an emoji, a mathematical symbol — has a number behind it. In Unicode, that number is called a code point. Code points are the atomic building blocks of the entire Unicode standard: before any encoding happens, before bytes hit the wire, before a font turns shapes into pixels, there is a code point that says this number means this character. Understanding code points is the single most important step toward understanding how text works in modern computing.

Definition

A Unicode code point is a unique non-negative integer assigned to an abstract character (or to a special-purpose entry like a control code or a reserved slot). The full range runs from 0 to 1,114,111 in decimal, or U+0000 to U+10FFFF in hexadecimal.

The U+ prefix is the standard notation. You will see it everywhere: in the Unicode Character Database, in programming documentation, in font specifications, and on this site. The digits after the plus sign are hexadecimal, zero-padded to at least four digits:

Notation	Decimal	Character	Name
U+0041	65	A	LATIN CAPITAL LETTER A
U+00E9	233	e	LATIN SMALL LETTER E WITH ACUTE
U+4E2D	20,013	中	CJK UNIFIED IDEOGRAPH-4E2D
U+1F600	128,512	😀	GRINNING FACE
U+10FFFD	1,114,109	(PUA)	Last valid code point in Plane 16

Code points above U+FFFF use five or six hex digits. There is no ambiguity because the U+ prefix signals that everything following it is a hex number.

Code Points Are Not Characters

This is a subtle but critical distinction. A code point is a number. A character is what a human reads. Most of the time the mapping is one-to-one — U+0041 is "A" and nothing else — but there are important exceptions:

Combining characters: U+0301 (COMBINING ACUTE ACCENT) is a code point, but it is not a standalone character. It modifies the preceding base character. The sequence U+0065 U+0301 produces "e" — one visible character built from two code points.
Surrogate code points: U+D800 through U+DFFF are reserved for UTF-16's internal machinery. They are code points but they are never valid characters.
Noncharacters: 66 code points (e.g., U+FFFE, U+FFFF, U+1FFFE) are permanently reserved as "noncharacters" that applications may use internally but must never exchange.
Private Use Areas: Ranges like U+E000–U+F8FF are code points without standard character assignments. Applications can use them for custom symbols, but there is no universal meaning.

The Unicode Standard formalises this with the concept of an abstract character — a unit of information used for the organization, control, or representation of textual data. A code point identifies an abstract character; it does not become one in every case.

The Code Point Space

The 1,114,112 possible code points (0x000000 to 0x10FFFF) are organized into 17 planes, each containing 65,536 (0x10000) code points:

Plane	Range	Name	Assigned Characters
0	U+0000–U+FFFF	Basic Multilingual Plane (BMP)	~64,000
1	U+10000–U+1FFFF	Supplementary Multilingual Plane (SMP)	~25,000
2	U+20000–U+2FFFF	Supplementary Ideographic Plane (SIP)	~60,000
3	U+30000–U+3FFFF	Tertiary Ideographic Plane (TIP)	~10,000
4–13	U+40000–U+DFFFF	Unassigned	0
14	U+E0000–U+EFFFF	Supplementary Special-purpose Plane (SSP)	~370
15–16	U+F0000–U+10FFFF	Private Use Area Planes	131,068 (PUA)

As of Unicode 16.0, approximately 154,998 code points have been assigned to named characters. That leaves over 800,000 code points available for future allocation — plenty of room for undeciphered historical scripts, newly created writing systems, and, inevitably, more emoji.

Code Points vs. Encodings

A code point is an abstract number. To store it in a file or transmit it over a network, you need an encoding — a rule that converts the number into a sequence of bytes:

Encoding	How U+0041 (A) is stored	How U+1F600 (😀) is stored
UTF-8	`0x41` (1 byte)	`0xF0 0x9F 0x98 0x80` (4 bytes)
UTF-16LE	`0x41 0x00` (2 bytes)	`0x3D 0xD8 0x00 0xDE` (4 bytes)
UTF-32LE	`0x41 0x00 0x00 0x00` (4 bytes)	`0x00 0xF6 0x01 0x00` (4 bytes)

The code point is always the same — U+0041 or U+1F600. What changes is the byte representation. This is why the Unicode Standard carefully separates the character repertoire (code points) from the encoding forms (UTF-8, UTF-16, UTF-32).

A common mistake is to say "the UTF-8 code for A is 0x41." More precisely: the code point for A is U+0041 (decimal 65), and UTF-8 encodes that code point as the byte 0x41.

Working with Code Points in Code

Python

Python 3 strings are sequences of Unicode code points. The built-in ord() function returns a code point as an integer, and chr() converts an integer back to a character:

# Getting a code point
print(ord("A"))        # 65
print(hex(ord("A")))   # 0x41
print(f"U+{ord('A'):04X}")  # U+0041

# Going from code point to character
print(chr(0x0041))     # A
print(chr(0x1F600))    # 😀

# Escape syntax — use \u for BMP, \U for all planes
bmp_char = "\u00e9"       # e (U+00E9)
astral_char = "\U0001F600"  # 😀 (U+1F600)

# Iterating code points in a string
text = "Cafe\u0301"  # "Cafe" — e + combining acute
for i, ch in enumerate(text):
    print(f"  [{i}] U+{ord(ch):04X}  {ch!r}")
# [0] U+0043  'C'
# [1] U+0061  'a'
# [2] U+0066  'f'
# [3] U+0065  'e'
# [4] U+0301  '\u0301'

JavaScript

JavaScript strings are sequences of UTF-16 code units, not code points. For characters outside the BMP, a single code point is stored as two code units (a surrogate pair):

// Getting a code point
"A".codePointAt(0)           // 65
"A".codePointAt(0).toString(16)  // "41"

// Going from code point to character
String.fromCodePoint(0x0041)     // "A"
String.fromCodePoint(0x1F600)    // "😀"

// ⚠️ .length counts UTF-16 code units, not code points
"😀".length                      // 2 (surrogate pair)
[..."😀"].length                 // 1 (spread iterates code points)

// Iterating by code point (not code unit)
for (const ch of "Cafe\u0301") {
    console.log(`U+${ch.codePointAt(0).toString(16).toUpperCase().padStart(4, "0")}`);
}

Rust

Rust's char type represents a single Unicode code point (a Unicode Scalar Value, which excludes surrogates):

let ch: char = 'A';
println!("U+{:04X}", ch as u32);  // U+0041

// Iterating code points
for ch in "Cafe\u{0301}".chars() {
    println!("U+{:04X}", ch as u32);
}

HTML

In HTML, you can insert any character by its code point using numeric character references:

<!-- Decimal reference -->
&#65;       <!-- A -->
&#128512;   <!-- 😀 -->

<!-- Hexadecimal reference -->
&#x0041;    <!-- A -->
&#x1F600;   <!-- 😀 -->

Surrogate Code Points (U+D800–U+DFFF)

The 2,048 code points from U+D800 to U+DFFF are permanently reserved as surrogates. UTF-16 uses pairs of surrogates to represent code points above U+FFFF:

A high surrogate (U+D800–U+DBFF) followed by a low surrogate (U+DC00–U+DFFF) together encode one code point in the range U+10000–U+10FFFF.
These code points never appear alone in well-formed Unicode text.
They are not characters and have no visual representation.

For example, U+1F600 (😀) is encoded in UTF-16 as the surrogate pair 0xD83D 0xDE00. The formula is:

High surrogate = 0xD800 + ((code_point - 0x10000) >> 10)
Low surrogate  = 0xDC00 + ((code_point - 0x10000) & 0x3FF)

For U+1F600: - 0x1F600 - 0x10000 = 0xF600 - High: 0xD800 + (0xF600 >> 10) = 0xD800 + 0x3D = 0xD83D - Low: 0xDC00 + (0xF600 & 0x3FF) = 0xDC00 + 0x200 = 0xDE00

Private Use Code Points

Three ranges are reserved for private use — applications can assign them any meaning they want:

Range	Name	Code Points
U+E000–U+F8FF	BMP Private Use Area	6,400
U+F0000–U+FFFFF	Supplementary PUA-A	65,534
U+100000–U+10FFFD	Supplementary PUA-B	65,534

Companies and communities use these for custom symbols. For example, Apple uses PUA code points for the Apple logo () in its system fonts. Conlang communities use them for constructed scripts. The key limitation is that PUA characters have no universal meaning — they only work when sender and receiver agree on the mapping.

Noncharacters

Unicode designates 66 code points as noncharacters — permanently reserved for internal use and never to be exchanged in open text:

U+FDD0 through U+FDEF (32 code points)
The last two code points of every plane: U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF (34 code points)

Applications may use these internally (for example, as sentinel values), but conformant processes must not remove them from text they are processing — they should simply not generate them for interchange.

Key Takeaways

A code point is a unique number from U+0000 to U+10FFFF assigned to an abstract character in the Unicode standard.
The U+ notation uses hexadecimal, zero-padded to at least four digits.
Code points are not the same as bytes — encodings (UTF-8, UTF-16, UTF-32) convert code points into byte sequences.
Not all code points are characters: surrogates, noncharacters, and unassigned code points exist within the code point space.
Python ord() / chr(), JavaScript codePointAt() / String.fromCodePoint(), and HTML &#xHHHH; all work directly with code points.
The full code point space has room for over 1.1 million entries across 17 planes — with plenty of space remaining for the future of human writing.