📚 Unicode Fundamentals

What is a Unicode Code Point?

A Unicode code point is the unique number assigned to each character in the Unicode standard, written in the form U+0041. This guide explains what code points are, how they are structured, and how they relate to the bytes stored in a file.

·

Every character you see on screen — a Latin letter, a Chinese ideograph, an emoji, a mathematical symbol — has a number behind it. In Unicode, that number is called a code point. Code points are the atomic building blocks of the entire Unicode standard: before any encoding happens, before bytes hit the wire, before a font turns shapes into pixels, there is a code point that says this number means this character. Understanding code points is the single most important step toward understanding how text works in modern computing.

Definition

A Unicode code point is a unique non-negative integer assigned to an abstract character (or to a special-purpose entry like a control code or a reserved slot). The full range runs from 0 to 1,114,111 in decimal, or U+0000 to U+10FFFF in hexadecimal.

The U+ prefix is the standard notation. You will see it everywhere: in the Unicode Character Database, in programming documentation, in font specifications, and on this site. The digits after the plus sign are hexadecimal, zero-padded to at least four digits:

Notation Decimal Character Name
U+0041 65 A LATIN CAPITAL LETTER A
U+00E9 233 e LATIN SMALL LETTER E WITH ACUTE
U+4E2D 20,013 CJK UNIFIED IDEOGRAPH-4E2D
U+1F600 128,512 😀 GRINNING FACE
U+10FFFD 1,114,109 (PUA) Last valid code point in Plane 16

Code points above U+FFFF use five or six hex digits. There is no ambiguity because the U+ prefix signals that everything following it is a hex number.

Code Points Are Not Characters

This is a subtle but critical distinction. A code point is a number. A character is what a human reads. Most of the time the mapping is one-to-one — U+0041 is "A" and nothing else — but there are important exceptions:

  • Combining characters: U+0301 (COMBINING ACUTE ACCENT) is a code point, but it is not a standalone character. It modifies the preceding base character. The sequence U+0065 U+0301 produces "e" — one visible character built from two code points.
  • Surrogate code points: U+D800 through U+DFFF are reserved for UTF-16's internal machinery. They are code points but they are never valid characters.
  • Noncharacters: 66 code points (e.g., U+FFFE, U+FFFF, U+1FFFE) are permanently reserved as "noncharacters" that applications may use internally but must never exchange.
  • Private Use Areas: Ranges like U+E000–U+F8FF are code points without standard character assignments. Applications can use them for custom symbols, but there is no universal meaning.

The Unicode Standard formalises this with the concept of an abstract character — a unit of information used for the organization, control, or representation of textual data. A code point identifies an abstract character; it does not become one in every case.

The Code Point Space

The 1,114,112 possible code points (0x000000 to 0x10FFFF) are organized into 17 planes, each containing 65,536 (0x10000) code points:

Plane Range Name Assigned Characters
0 U+0000–U+FFFF Basic Multilingual Plane (BMP) ~64,000
1 U+10000–U+1FFFF Supplementary Multilingual Plane (SMP) ~25,000
2 U+20000–U+2FFFF Supplementary Ideographic Plane (SIP) ~60,000
3 U+30000–U+3FFFF Tertiary Ideographic Plane (TIP) ~10,000
4–13 U+40000–U+DFFFF Unassigned 0
14 U+E0000–U+EFFFF Supplementary Special-purpose Plane (SSP) ~370
15–16 U+F0000–U+10FFFF Private Use Area Planes 131,068 (PUA)

As of Unicode 16.0, approximately 154,998 code points have been assigned to named characters. That leaves over 800,000 code points available for future allocation — plenty of room for undeciphered historical scripts, newly created writing systems, and, inevitably, more emoji.

Code Points vs. Encodings

A code point is an abstract number. To store it in a file or transmit it over a network, you need an encoding — a rule that converts the number into a sequence of bytes:

Encoding How U+0041 (A) is stored How U+1F600 (😀) is stored
UTF-8 0x41 (1 byte) 0xF0 0x9F 0x98 0x80 (4 bytes)
UTF-16LE 0x41 0x00 (2 bytes) 0x3D 0xD8 0x00 0xDE (4 bytes)
UTF-32LE 0x41 0x00 0x00 0x00 (4 bytes) 0x00 0xF6 0x01 0x00 (4 bytes)

The code point is always the same — U+0041 or U+1F600. What changes is the byte representation. This is why the Unicode Standard carefully separates the character repertoire (code points) from the encoding forms (UTF-8, UTF-16, UTF-32).

A common mistake is to say "the UTF-8 code for A is 0x41." More precisely: the code point for A is U+0041 (decimal 65), and UTF-8 encodes that code point as the byte 0x41.

Working with Code Points in Code

Python

Python 3 strings are sequences of Unicode code points. The built-in ord() function returns a code point as an integer, and chr() converts an integer back to a character:

# Getting a code point
print(ord("A"))        # 65
print(hex(ord("A")))   # 0x41
print(f"U+{ord('A'):04X}")  # U+0041

# Going from code point to character
print(chr(0x0041))     # A
print(chr(0x1F600))    # 😀

# Escape syntax — use \u for BMP, \U for all planes
bmp_char = "\u00e9"       # e (U+00E9)
astral_char = "\U0001F600"  # 😀 (U+1F600)

# Iterating code points in a string
text = "Cafe\u0301"  # "Cafe" — e + combining acute
for i, ch in enumerate(text):
    print(f"  [{i}] U+{ord(ch):04X}  {ch!r}")
# [0] U+0043  'C'
# [1] U+0061  'a'
# [2] U+0066  'f'
# [3] U+0065  'e'
# [4] U+0301  '\u0301'

JavaScript

JavaScript strings are sequences of UTF-16 code units, not code points. For characters outside the BMP, a single code point is stored as two code units (a surrogate pair):

// Getting a code point
"A".codePointAt(0)           // 65
"A".codePointAt(0).toString(16)  // "41"

// Going from code point to character
String.fromCodePoint(0x0041)     // "A"
String.fromCodePoint(0x1F600)    // "😀"

// ⚠️ .length counts UTF-16 code units, not code points
"😀".length                      // 2 (surrogate pair)
[..."😀"].length                 // 1 (spread iterates code points)

// Iterating by code point (not code unit)
for (const ch of "Cafe\u0301") {
    console.log(`U+${ch.codePointAt(0).toString(16).toUpperCase().padStart(4, "0")}`);
}

Rust

Rust's char type represents a single Unicode code point (a Unicode Scalar Value, which excludes surrogates):

let ch: char = 'A';
println!("U+{:04X}", ch as u32);  // U+0041

// Iterating code points
for ch in "Cafe\u{0301}".chars() {
    println!("U+{:04X}", ch as u32);
}

HTML

In HTML, you can insert any character by its code point using numeric character references:

<!-- Decimal reference -->
&#65;       <!-- A -->
&#128512;   <!-- 😀 -->

<!-- Hexadecimal reference -->
&#x0041;    <!-- A -->
&#x1F600;   <!-- 😀 -->

Surrogate Code Points (U+D800–U+DFFF)

The 2,048 code points from U+D800 to U+DFFF are permanently reserved as surrogates. UTF-16 uses pairs of surrogates to represent code points above U+FFFF:

  • A high surrogate (U+D800–U+DBFF) followed by a low surrogate (U+DC00–U+DFFF) together encode one code point in the range U+10000–U+10FFFF.
  • These code points never appear alone in well-formed Unicode text.
  • They are not characters and have no visual representation.

For example, U+1F600 (😀) is encoded in UTF-16 as the surrogate pair 0xD83D 0xDE00. The formula is:

High surrogate = 0xD800 + ((code_point - 0x10000) >> 10)
Low surrogate  = 0xDC00 + ((code_point - 0x10000) & 0x3FF)

For U+1F600: - 0x1F600 - 0x10000 = 0xF600 - High: 0xD800 + (0xF600 >> 10) = 0xD800 + 0x3D = 0xD83D - Low: 0xDC00 + (0xF600 & 0x3FF) = 0xDC00 + 0x200 = 0xDE00

Private Use Code Points

Three ranges are reserved for private use — applications can assign them any meaning they want:

Range Name Code Points
U+E000–U+F8FF BMP Private Use Area 6,400
U+F0000–U+FFFFF Supplementary PUA-A 65,534
U+100000–U+10FFFD Supplementary PUA-B 65,534

Companies and communities use these for custom symbols. For example, Apple uses PUA code points for the Apple logo () in its system fonts. Conlang communities use them for constructed scripts. The key limitation is that PUA characters have no universal meaning — they only work when sender and receiver agree on the mapping.

Noncharacters

Unicode designates 66 code points as noncharacters — permanently reserved for internal use and never to be exchanged in open text:

  • U+FDD0 through U+FDEF (32 code points)
  • The last two code points of every plane: U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF (34 code points)

Applications may use these internally (for example, as sentinel values), but conformant processes must not remove them from text they are processing — they should simply not generate them for interchange.

Key Takeaways

  • A code point is a unique number from U+0000 to U+10FFFF assigned to an abstract character in the Unicode standard.
  • The U+ notation uses hexadecimal, zero-padded to at least four digits.
  • Code points are not the same as bytes — encodings (UTF-8, UTF-16, UTF-32) convert code points into byte sequences.
  • Not all code points are characters: surrogates, noncharacters, and unassigned code points exist within the code point space.
  • Python ord() / chr(), JavaScript codePointAt() / String.fromCodePoint(), and HTML &#xHHHH; all work directly with code points.
  • The full code point space has room for over 1.1 million entries across 17 planes — with plenty of space remaining for the future of human writing.

المزيد في Unicode Fundamentals

What is Unicode? A Complete Guide

Unicode is the universal character encoding standard that assigns a unique number …

UTF-8 Encoding Explained

UTF-8 is the dominant character encoding on the web, capable of representing …

UTF-8 vs UTF-16 vs UTF-32: When to Use Each

UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …

Unicode Planes and the BMP

Unicode is divided into 17 planes, each containing up to 65,536 code …

Understanding Byte Order Mark (BOM)

The Byte Order Mark (BOM) is a special Unicode character used at …

Surrogate Pairs Explained

Surrogate pairs are a mechanism in UTF-16 that allows code points outside …

ASCII to Unicode: The Evolution of Character Encoding

ASCII defined 128 characters for the English alphabet and was the foundation …

Unicode Normalization: NFC, NFD, NFKC, NFKD

The same visible character can be represented by multiple different byte sequences …

The Unicode Bidirectional Algorithm

The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …

Unicode General Categories Explained

Every Unicode character belongs to a general category such as Letter, Number, …

Understanding Unicode Blocks

Unicode blocks are contiguous ranges of code points grouped by script or …

Unicode Scripts: How Writing Systems are Organized

Unicode assigns every character to a script property that identifies the writing …

What are Combining Characters?

Combining characters are Unicode code points that attach to a preceding base …

Grapheme Clusters vs Code Points

A single visible character on screen — called a grapheme cluster — …

Unicode Confusables: A Security Guide

Unicode confusables are characters that look identical or nearly identical to others, …

Zero Width Characters: What They Are and Why They Matter

Zero-width characters are invisible Unicode code points that affect text layout, joining, …

Unicode Whitespace Characters Guide

Unicode defines over two dozen whitespace characters beyond the ordinary space, including …

History of Unicode

Unicode began in 1987 as a collaboration between engineers at Apple and …

Unicode Versions Timeline

Unicode has released major versions regularly since 1.0 in 1991, with each …