What is Ponto de código?

Valor numérico no espaço de código Unicode (U+0000 a U+10FFFF), escrito como U+XXXX. Nem todos os pontos de código são atribuídos a caracteres.

What is Consórcio Unicode?

Organização sem fins lucrativos que desenvolve e mantém o Padrão Unicode. Os membros incluem Apple, Google, Microsoft, Meta e muitas outras empresas.

Codificação Unicode de comprimento variável usando de 1 a 4 bytes por caractere. É a codificação dominante na web (mais de 98% dos sites) com total compatibilidade retroativa com ASCII.

Padrão Unicode

Unicode

Padrão universal de codificação de caracteres que atribui um número único (ponto de código) a cada caractere em todos os sistemas de escrita. A versão 16.0 contém 154.998 caracteres atribuídos.

2021-05-01 · Updated 2024-11-18

What is Unicode?

Unicode is the universal character encoding standard that assigns a unique number — called a code point — to every character in every writing system on Earth. Before Unicode existed, computers relied on hundreds of incompatible encoding systems: Windows-1252 for Western Europe, Shift-JIS for Japanese, GB2312 for Simplified Chinese. Moving text between these systems produced mojibake (文字化け), garbled output caused by each system interpreting the same byte sequence differently.

Unicode solved this by establishing a single, shared namespace: one number, one character, no ambiguity. The standard covers scripts from Latin to Arabic, emoji, mathematical symbols, ancient languages like Linear B, and even private-use zones for custom characters.

How Unicode Works

Unicode separates two concerns that older encodings conflated:

The character repertoire — which characters exist and what their code points are
The encoding form — how those code points are serialized into bytes (UTF-8, UTF-16, UTF-32)

This separation means you can transmit Unicode text in the encoding best suited to your context. UTF-8 is dominant on the web; UTF-16 is used internally by Java, JavaScript, and Windows; UTF-32 offers fixed-width simplicity for internal processing.

The Unicode Standard

The Unicode Standard is a living specification maintained by the Unicode Consortium. Each version adds new characters, scripts, and emoji. Version 16.0 (September 2024) contains 154,998 assigned characters across 168 scripts. The standard defines not just code points, but also:

Character properties: General category (letter, digit, punctuation...), bidirectional class, combining class, case mappings, and dozens more
Algorithms: Unicode Bidirectional Algorithm (UBA) for mixed-direction text, Unicode Collation Algorithm (UCA) for sorting, line-breaking rules, normalization forms
Named sequences: Pre-defined sequences of code points with official names

Concrete Examples

# Python: every string is Unicode by default (Python 3)
s = "Hello, 世界! 🌍"
print(len(s))        # 11 characters
print(s[7])          # 界
print(ord(s[7]))     # 30028 (decimal) = U+754C
print(f"U+{ord(s[7]):04X}")  # U+754C

// JavaScript: strings are UTF-16 internally
const s = "Hello, 世界! 🌍";
console.log(s.length);          // 13 (🌍 counts as 2 UTF-16 code units)
console.log([...s].length);     // 11 (spread iterator counts Unicode scalars)

Common Misconceptions

"Unicode is an encoding" — Unicode is a character set standard; UTF-8, UTF-16, and UTF-32 are the encodings that serialize Unicode code points into bytes.

"Unicode only covers modern scripts" — Unicode includes hundreds of historic scripts (Egyptian Hieroglyphs, Cuneiform, Old Persian) and even some invented scripts (Tengwar proposals exist, though not yet accepted).

"All Unicode characters fit in 2 bytes" — Only the Basic Multilingual Plane (U+0000–U+FFFF) fits in 16 bits. Characters above U+FFFF require 3–4 bytes in UTF-8 or surrogate pairs in UTF-16.

Quick Facts

Property	Value
First version	Unicode 1.0 (1991)
Current version	16.0 (September 2024)
Total code space	1,114,112 code points (U+0000–U+10FFFF)
Assigned characters (v16.0)	154,998
Number of scripts	168
Maintained by	Unicode Consortium
Synchronized standard	ISO/IEC 10646
Dominant web encoding	UTF-8 (98%+ of websites)

Termos Relacionados

Ponto de código Consórcio Unicode UTF-8

Mais em Padrão Unicode

Basic Multilingual Plane (BMP)

Plano 0 (U+0000–U+FFFF), contendo os caracteres mais usados, incluindo latino, grego, cirílico, …

Caractere abstrato

Unidade de informação usada para organizar, controlar ou representar dados textuais — …

Caractere atribuído

Ponto de código ao qual foi atribuída uma designação de caractere em …

CJK

Chinês, Japonês e Coreano — o termo coletivo para o bloco de …

Consórcio Unicode

Organização sem fins lucrativos que desenvolve e mantém o Padrão Unicode. Os …

Espaço de código

O intervalo completo de possíveis pontos de código Unicode: U+0000 a U+10FFFF …

Han Unification

The process of mapping Chinese, Japanese, and Korean ideographs that share a …

Hangul Jamo

The individual consonant and vowel components (jamo) of the Korean Hangul writing …

ISO 10646 / Universal Character Set

Norma internacional (ISO/IEC 10646) sincronizada com o Unicode, definindo o mesmo repertório …

Não-caractere

Pontos de código permanentemente reservados para uso interno (66 no total): U+FDD0–U+FDEF …

← Voltar ao Glossário