Unicode
为每种书写系统中的每个字符分配唯一编号(码位)的通用字符编码标准,16.0版本包含154,998个已分配字符。
What is Unicode?
Unicode is the universal character encoding standard that assigns a unique number — called a code point — to every character in every writing system on Earth. Before Unicode existed, computers relied on hundreds of incompatible encoding systems: Windows-1252 for Western Europe, Shift-JIS for Japanese, GB2312 for Simplified Chinese. Moving text between these systems produced mojibake (文字化け), garbled output caused by each system interpreting the same byte sequence differently.
Unicode solved this by establishing a single, shared namespace: one number, one character, no ambiguity. The standard covers scripts from Latin to Arabic, emoji, mathematical symbols, ancient languages like Linear B, and even private-use zones for custom characters.
How Unicode Works
Unicode separates two concerns that older encodings conflated:
- The character repertoire — which characters exist and what their code points are
- The encoding form — how those code points are serialized into bytes (UTF-8, UTF-16, UTF-32)
This separation means you can transmit Unicode text in the encoding best suited to your context. UTF-8 is dominant on the web; UTF-16 is used internally by Java, JavaScript, and Windows; UTF-32 offers fixed-width simplicity for internal processing.
The Unicode Standard
The Unicode Standard is a living specification maintained by the Unicode Consortium. Each version adds new characters, scripts, and emoji. Version 16.0 (September 2024) contains 154,998 assigned characters across 168 scripts. The standard defines not just code points, but also:
- Character properties: General category (letter, digit, punctuation...), bidirectional class, combining class, case mappings, and dozens more
- Algorithms: Unicode Bidirectional Algorithm (UBA) for mixed-direction text, Unicode Collation Algorithm (UCA) for sorting, line-breaking rules, normalization forms
- Named sequences: Pre-defined sequences of code points with official names
Concrete Examples
# Python: every string is Unicode by default (Python 3)
s = "Hello, 世界! 🌍"
print(len(s)) # 11 characters
print(s[7]) # 界
print(ord(s[7])) # 30028 (decimal) = U+754C
print(f"U+{ord(s[7]):04X}") # U+754C
// JavaScript: strings are UTF-16 internally
const s = "Hello, 世界! 🌍";
console.log(s.length); // 13 (🌍 counts as 2 UTF-16 code units)
console.log([...s].length); // 11 (spread iterator counts Unicode scalars)
Common Misconceptions
"Unicode is an encoding" — Unicode is a character set standard; UTF-8, UTF-16, and UTF-32 are the encodings that serialize Unicode code points into bytes.
"Unicode only covers modern scripts" — Unicode includes hundreds of historic scripts (Egyptian Hieroglyphs, Cuneiform, Old Persian) and even some invented scripts (Tengwar proposals exist, though not yet accepted).
"All Unicode characters fit in 2 bytes" — Only the Basic Multilingual Plane (U+0000–U+FFFF) fits in 16 bits. Characters above U+FFFF require 3–4 bytes in UTF-8 or surrogate pairs in UTF-16.
Quick Facts
| Property | Value |
|---|---|
| First version | Unicode 1.0 (1991) |
| Current version | 16.0 (September 2024) |
| Total code space | 1,114,112 code points (U+0000–U+10FFFF) |
| Assigned characters (v16.0) | 154,998 |
| Number of scripts | 168 |
| Maintained by | Unicode Consortium |
| Synchronized standard | ISO/IEC 10646 |
| Dominant web encoding | UTF-8 (98%+ of websites) |
相关术语
Unicode 标准 中的更多内容
中日韩——Unicode中统一汉字区块及相关文字系统的统称,CJK统一表意文字包含20,992个以上字符。
The process of mapping Chinese, Japanese, and Korean ideographs that share a …
The individual consonant and vowel components (jamo) of the Korean Hangul writing …
与Unicode同步的国际标准(ISO/IEC 10646),定义相同的字符集和码位,但不包含Unicode额外的算法和属性。
Normative or informative documents that are integral parts of the Unicode Standard. …
Informational documents published by the Unicode Consortium covering specific topics like security …
定义所有Unicode字符属性的机器可读数据文件集合,包括UnicodeData.txt、Blocks.txt、Scripts.txt等。
除代理码位(U+D800–U+DFFF)之外的所有码位,是可表示实际字符的有效值集合,共1,112,064个。
Unicode标准的主要版本,每次发布均新增字符、文字系统和功能,当前版本为Unicode 16.0(2025年9月)。
保证字符一旦分配,其码位和名称永不更改的策略。属性可以精化,但分配是永久性的。