What is Unidad de código?

La unidad mínima de codificación: un byte de 8 bits en UTF-8, una palabra de 16 bits en UTF-16, una palabra de 32 bits en UTF-32. Un solo carácter puede requerir múltiples unidades de código.

Estándar universal de codificación de caracteres que asigna un número único (punto de código) a cada carácter de todos los sistemas de escritura. La versión 16.0 contiene 154.998 caracteres asignados.

What is Espacio de código?

El rango completo de posibles puntos de código Unicode: U+0000 a U+10FFFF (1.114.112 en total), dividido en 17 planos de 65.536 puntos de código cada uno.

Estándar Unicode

Punto de código

Valor numérico en el espacio de código Unicode (U+0000 a U+10FFFF), escrito como U+XXXX. No todos los puntos de código están asignados a caracteres.

2021-05-17 · Updated 2024-12-03

What is a Code Point?

A code point is the fundamental unit of the Unicode standard: a unique integer assigned to a single character, symbol, or abstract entity. Code points are written in the format U+XXXX where the Xs are hexadecimal digits — for example, U+0041 for the Latin capital letter A, or U+1F600 for the grinning face emoji 😀.

The Unicode code space spans from U+0000 to U+10FFFF, providing 1,114,112 possible positions. Not every position is occupied — as of Unicode 16.0, approximately 154,998 are assigned. The rest are either unassigned, reserved, or permanently set aside as noncharacters.

Anatomy of a Code Point Notation

U+1F600
│ │────┘
│  └── Hexadecimal value (6 digits for supplementary, 4 for BMP)
└───── "U+" prefix (Unicode notation)

Decimal equivalent: 128,512
Binary: 1 1111 0110 0000 0000

The "U+" prefix is a notation convention; it is not part of the value itself. When code points appear in source code or data, they use encoding-specific escape sequences instead:

Language	Escape syntax	Example (U+1F600)
Python	`\U0001F600`	`"\U0001F600"`
JavaScript	`\u{1F600}`	`"\u{1F600}"`
Java	surrogate pair	`"\uD83D\uDE00"`
CSS	`\1F600`	`content: "\1F600"`
HTML	`😀`	`😀`
Rust	`\u{1F600}`	`'\u{1F600}'`

Code Points vs Characters

A code point is not always identical to what a user perceives as a "character" (called a grapheme cluster). Consider:

é  can be represented as:
  U+00E9  (LATIN SMALL LETTER E WITH ACUTE — single code point)
  U+0065 + U+0301  (e + combining acute accent — two code points)

🏳️‍🌈  (rainbow flag) is:
  U+1F3F3 + U+FE0F + U+200D + U+1F308  (four code points, one visible character)

This distinction matters for string length calculations, cursor movement, and text editing.

Ranges and Planes

Code points are organized into 17 planes of 65,536 values each:

Plane 0 (U+0000–U+FFFF): Basic Multilingual Plane — most everyday characters
Plane 1 (U+10000–U+1FFFF): Supplementary Multilingual Plane — historic scripts, emoji
Plane 2 (U+20000–U+2FFFF): Supplementary Ideographic Plane — CJK extension ideographs
Planes 3–13: Mostly unassigned
Plane 14 (U+E0000–U+EFFFF): Tags (language tags, variation selectors)
Planes 15–16 (U+F0000–U+10FFFF): Private Use Areas

Common Pitfalls

String length confusion: In many languages, length returns the number of code units (bytes or 16-bit words), not code points, and neither matches the number of visible characters.

s = "😀"
len(s)          # Python 3: 1 (correct — one code point)
len(s.encode()) # 4 (bytes in UTF-8)

"😀".length       // 2 (UTF-16 surrogate pair!)
[..."😀"].length  // 1 (iterating by code point)

BMP assumption: Legacy code that assumes all characters fit in 16 bits (a BMP-only assumption) breaks on emoji, historic scripts, and rare CJK extensions.

Quick Facts

Property	Value
Notation	U+XXXX (4–6 hex digits)
Minimum	U+0000 (NULL)
Maximum	U+10FFFF
Total possible	1,114,112
Assigned (v16.0)	~154,998
First emoji	U+00AE ® (registered sign, Unicode 1.1)
Highest assigned emoji	U+1FAE8 (shaking face, v15.0)

Términos relacionados

Unidad de código Unicode Espacio de código

Más en Estándar Unicode

Basic Multilingual Plane (BMP)

Plano 0 (U+0000–U+FFFF), que contiene los caracteres más utilizados: Latin, Griego, Cirílico, …

Carácter abstracto

Unidad de información usada para organizar, controlar o representar datos textuales — …

Carácter asignado

Punto de código al que se le ha asignado un carácter en …

CJK

Chino, Japonés y Coreano — el término colectivo para el bloque de …

Consorcio Unicode

Organización sin fines de lucro que desarrolla y mantiene el Estándar Unicode. …

Espacio de código

El rango completo de posibles puntos de código Unicode: U+0000 a U+10FFFF …

Han Unification

The process of mapping Chinese, Japanese, and Korean ideographs that share a …

Hangul Jamo

The individual consonant and vowel components (jamo) of the Korean Hangul writing …

ISO 10646 / Universal Character Set

Estándar internacional (ISO/IEC 10646) sincronizado con Unicode, que define el mismo repertorio …

No carácter

Puntos de código reservados permanentemente para uso interno (66 en total): U+FDD0–U+FDEF …

← Volver al glosario