What is a Unicode Code Point?
A Unicode code point is the unique number assigned to each character in the Unicode standard, written in the form U+0041. This guide explains what code points are, how they are structured, and how they relate to the bytes stored in a file.
Every character you see on screen — a Latin letter, a Chinese ideograph, an emoji, a mathematical symbol — has a number behind it. In Unicode, that number is called a code point. Code points are the atomic building blocks of the entire Unicode standard: before any encoding happens, before bytes hit the wire, before a font turns shapes into pixels, there is a code point that says this number means this character. Understanding code points is the single most important step toward understanding how text works in modern computing.
Definition
A Unicode code point is a unique non-negative integer assigned to an abstract character (or to a
special-purpose entry like a control code or a reserved slot). The full range runs from 0 to
1,114,111 in decimal, or U+0000 to U+10FFFF in hexadecimal.
The U+ prefix is the standard notation. You will see it everywhere: in the Unicode Character
Database, in programming documentation, in font specifications, and on this site. The digits
after the plus sign are hexadecimal, zero-padded to at least four digits:
| Notation | Decimal | Character | Name |
|---|---|---|---|
| U+0041 | 65 | A | LATIN CAPITAL LETTER A |
| U+00E9 | 233 | e | LATIN SMALL LETTER E WITH ACUTE |
| U+4E2D | 20,013 | 中 | CJK UNIFIED IDEOGRAPH-4E2D |
| U+1F600 | 128,512 | 😀 | GRINNING FACE |
| U+10FFFD | 1,114,109 | (PUA) | Last valid code point in Plane 16 |
Code points above U+FFFF use five or six hex digits. There is no ambiguity because the U+
prefix signals that everything following it is a hex number.
Code Points Are Not Characters
This is a subtle but critical distinction. A code point is a number. A character is what a human reads. Most of the time the mapping is one-to-one — U+0041 is "A" and nothing else — but there are important exceptions:
- Combining characters: U+0301 (COMBINING ACUTE ACCENT) is a code point, but it is not a standalone character. It modifies the preceding base character. The sequence U+0065 U+0301 produces "e" — one visible character built from two code points.
- Surrogate code points: U+D800 through U+DFFF are reserved for UTF-16's internal machinery. They are code points but they are never valid characters.
- Noncharacters: 66 code points (e.g., U+FFFE, U+FFFF, U+1FFFE) are permanently reserved as "noncharacters" that applications may use internally but must never exchange.
- Private Use Areas: Ranges like U+E000–U+F8FF are code points without standard character assignments. Applications can use them for custom symbols, but there is no universal meaning.
The Unicode Standard formalises this with the concept of an abstract character — a unit of information used for the organization, control, or representation of textual data. A code point identifies an abstract character; it does not become one in every case.
The Code Point Space
The 1,114,112 possible code points (0x000000 to 0x10FFFF) are organized into 17 planes, each containing 65,536 (0x10000) code points:
| Plane | Range | Name | Assigned Characters |
|---|---|---|---|
| 0 | U+0000–U+FFFF | Basic Multilingual Plane (BMP) | ~64,000 |
| 1 | U+10000–U+1FFFF | Supplementary Multilingual Plane (SMP) | ~25,000 |
| 2 | U+20000–U+2FFFF | Supplementary Ideographic Plane (SIP) | ~60,000 |
| 3 | U+30000–U+3FFFF | Tertiary Ideographic Plane (TIP) | ~10,000 |
| 4–13 | U+40000–U+DFFFF | Unassigned | 0 |
| 14 | U+E0000–U+EFFFF | Supplementary Special-purpose Plane (SSP) | ~370 |
| 15–16 | U+F0000–U+10FFFF | Private Use Area Planes | 131,068 (PUA) |
As of Unicode 16.0, approximately 154,998 code points have been assigned to named characters. That leaves over 800,000 code points available for future allocation — plenty of room for undeciphered historical scripts, newly created writing systems, and, inevitably, more emoji.
Code Points vs. Encodings
A code point is an abstract number. To store it in a file or transmit it over a network, you need an encoding — a rule that converts the number into a sequence of bytes:
| Encoding | How U+0041 (A) is stored | How U+1F600 (😀) is stored |
|---|---|---|
| UTF-8 | 0x41 (1 byte) |
0xF0 0x9F 0x98 0x80 (4 bytes) |
| UTF-16LE | 0x41 0x00 (2 bytes) |
0x3D 0xD8 0x00 0xDE (4 bytes) |
| UTF-32LE | 0x41 0x00 0x00 0x00 (4 bytes) |
0x00 0xF6 0x01 0x00 (4 bytes) |
The code point is always the same — U+0041 or U+1F600. What changes is the byte representation. This is why the Unicode Standard carefully separates the character repertoire (code points) from the encoding forms (UTF-8, UTF-16, UTF-32).
A common mistake is to say "the UTF-8 code for A is 0x41." More precisely: the code point for
A is U+0041 (decimal 65), and UTF-8 encodes that code point as the byte 0x41.
Working with Code Points in Code
Python
Python 3 strings are sequences of Unicode code points. The built-in ord() function returns a
code point as an integer, and chr() converts an integer back to a character:
# Getting a code point
print(ord("A")) # 65
print(hex(ord("A"))) # 0x41
print(f"U+{ord('A'):04X}") # U+0041
# Going from code point to character
print(chr(0x0041)) # A
print(chr(0x1F600)) # 😀
# Escape syntax — use \u for BMP, \U for all planes
bmp_char = "\u00e9" # e (U+00E9)
astral_char = "\U0001F600" # 😀 (U+1F600)
# Iterating code points in a string
text = "Cafe\u0301" # "Cafe" — e + combining acute
for i, ch in enumerate(text):
print(f" [{i}] U+{ord(ch):04X} {ch!r}")
# [0] U+0043 'C'
# [1] U+0061 'a'
# [2] U+0066 'f'
# [3] U+0065 'e'
# [4] U+0301 '\u0301'
JavaScript
JavaScript strings are sequences of UTF-16 code units, not code points. For characters outside the BMP, a single code point is stored as two code units (a surrogate pair):
// Getting a code point
"A".codePointAt(0) // 65
"A".codePointAt(0).toString(16) // "41"
// Going from code point to character
String.fromCodePoint(0x0041) // "A"
String.fromCodePoint(0x1F600) // "😀"
// ⚠️ .length counts UTF-16 code units, not code points
"😀".length // 2 (surrogate pair)
[..."😀"].length // 1 (spread iterates code points)
// Iterating by code point (not code unit)
for (const ch of "Cafe\u0301") {
console.log(`U+${ch.codePointAt(0).toString(16).toUpperCase().padStart(4, "0")}`);
}
Rust
Rust's char type represents a single Unicode code point (a Unicode Scalar Value, which
excludes surrogates):
let ch: char = 'A';
println!("U+{:04X}", ch as u32); // U+0041
// Iterating code points
for ch in "Cafe\u{0301}".chars() {
println!("U+{:04X}", ch as u32);
}
HTML
In HTML, you can insert any character by its code point using numeric character references:
<!-- Decimal reference -->
A <!-- A -->
😀 <!-- 😀 -->
<!-- Hexadecimal reference -->
A <!-- A -->
😀 <!-- 😀 -->
Surrogate Code Points (U+D800–U+DFFF)
The 2,048 code points from U+D800 to U+DFFF are permanently reserved as surrogates. UTF-16 uses pairs of surrogates to represent code points above U+FFFF:
- A high surrogate (U+D800–U+DBFF) followed by a low surrogate (U+DC00–U+DFFF) together encode one code point in the range U+10000–U+10FFFF.
- These code points never appear alone in well-formed Unicode text.
- They are not characters and have no visual representation.
For example, U+1F600 (😀) is encoded in UTF-16 as the surrogate pair 0xD83D 0xDE00. The
formula is:
High surrogate = 0xD800 + ((code_point - 0x10000) >> 10)
Low surrogate = 0xDC00 + ((code_point - 0x10000) & 0x3FF)
For U+1F600:
- 0x1F600 - 0x10000 = 0xF600
- High: 0xD800 + (0xF600 >> 10) = 0xD800 + 0x3D = 0xD83D
- Low: 0xDC00 + (0xF600 & 0x3FF) = 0xDC00 + 0x200 = 0xDE00
Private Use Code Points
Three ranges are reserved for private use — applications can assign them any meaning they want:
| Range | Name | Code Points |
|---|---|---|
| U+E000–U+F8FF | BMP Private Use Area | 6,400 |
| U+F0000–U+FFFFF | Supplementary PUA-A | 65,534 |
| U+100000–U+10FFFD | Supplementary PUA-B | 65,534 |
Companies and communities use these for custom symbols. For example, Apple uses PUA code points for the Apple logo () in its system fonts. Conlang communities use them for constructed scripts. The key limitation is that PUA characters have no universal meaning — they only work when sender and receiver agree on the mapping.
Noncharacters
Unicode designates 66 code points as noncharacters — permanently reserved for internal use and never to be exchanged in open text:
- U+FDD0 through U+FDEF (32 code points)
- The last two code points of every plane: U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF (34 code points)
Applications may use these internally (for example, as sentinel values), but conformant processes must not remove them from text they are processing — they should simply not generate them for interchange.
Key Takeaways
- A code point is a unique number from U+0000 to U+10FFFF assigned to an abstract character in the Unicode standard.
- The
U+notation uses hexadecimal, zero-padded to at least four digits. - Code points are not the same as bytes — encodings (UTF-8, UTF-16, UTF-32) convert code points into byte sequences.
- Not all code points are characters: surrogates, noncharacters, and unassigned code points exist within the code point space.
- Python
ord()/chr(), JavaScriptcodePointAt()/String.fromCodePoint(), and HTML&#xHHHH;all work directly with code points. - The full code point space has room for over 1.1 million entries across 17 planes — with plenty of space remaining for the future of human writing.
Unicode Fundamentals의 더 많은 가이드
Unicode is the universal character encoding standard that assigns a unique number …
UTF-8 is the dominant character encoding on the web, capable of representing …
UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …
Unicode is divided into 17 planes, each containing up to 65,536 code …
The Byte Order Mark (BOM) is a special Unicode character used at …
Surrogate pairs are a mechanism in UTF-16 that allows code points outside …
ASCII defined 128 characters for the English alphabet and was the foundation …
The same visible character can be represented by multiple different byte sequences …
The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …
Every Unicode character belongs to a general category such as Letter, Number, …
Unicode blocks are contiguous ranges of code points grouped by script or …
Unicode assigns every character to a script property that identifies the writing …
Combining characters are Unicode code points that attach to a preceding base …
A single visible character on screen — called a grapheme cluster — …
Unicode confusables are characters that look identical or nearly identical to others, …
Zero-width characters are invisible Unicode code points that affect text layout, joining, …
Unicode defines over two dozen whitespace characters beyond the ordinary space, including …
Unicode began in 1987 as a collaboration between engineers at Apple and …
Unicode has released major versions regularly since 1.0 in 1991, with each …