码元
编码的最小单位:UTF-8中为8位字节,UTF-16中为16位字,UTF-32中为32位字。一个字符可能需要多个码元。
What is a Code Unit?
A code unit is the minimal bit combination used in a Unicode encoding. Different encodings use different code unit sizes:
- UTF-8: 8-bit code units (bytes)
- UTF-16: 16-bit code units (2-byte words)
- UTF-32: 32-bit code units (4-byte words)
A code unit is not the same as a code point. Code points are abstract Unicode values (U+0000–U+10FFFF); code units are the concrete building blocks that encodings use to represent those values. One code point may require one or more code units depending on the encoding and the code point's value.
Code Units by Encoding
UTF-8 (8-bit code units)
UTF-8 uses 1 to 4 bytes per code point, following a variable-length scheme:
| Code point range | Code units | Byte pattern |
|---|---|---|
| U+0000–U+007F | 1 | 0xxxxxxx |
| U+0080–U+07FF | 2 | 110xxxxx 10xxxxxx |
| U+0800–U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx |
| U+10000–U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
'A' (U+0041) → 1 code unit: 0x41
'é' (U+00E9) → 2 code units: 0xC3 0xA9
'中' (U+4E2D) → 3 code units: 0xE4 0xB8 0xAD
'😀' (U+1F600) → 4 code units: 0xF0 0x9F 0x98 0x80
UTF-16 (16-bit code units)
BMP characters (U+0000–U+FFFF) use 1 code unit. Supplementary characters use 2 code units (a surrogate pair):
'A' (U+0041) → 1 code unit: 0x0041
'中' (U+4E2D) → 1 code unit: 0x4E2D
'😀' (U+1F600) → 2 code units: 0xD83D 0xDE00 (surrogate pair)
UTF-32 (32-bit code units)
Every code point uses exactly 1 code unit — UTF-32 is the only fixed-width Unicode encoding:
'A' (U+0041) → 0x00000041
'中' (U+4E2D) → 0x00004E2D
'😀' (U+1F600) → 0x0001F600
Why Code Units Matter in Programming
Many programming languages expose string length in terms of code units, not code points or grapheme clusters:
s = "😀"
len(s) # Python 3: 1 — counts code points (Unicode scalars)
len(s.encode("utf-8")) # 4 — UTF-8 code units (bytes)
"😀".length // 2 — JavaScript counts UTF-16 code units
[..."😀"].length // 1 — spread iterator counts code points
String s = "😀";
s.length() // 2 — Java String.length() counts UTF-16 code units
s.codePointCount(0, s.length()) // 1 — code point count
This is a common source of bugs: naive string slicing by index in JavaScript or Java can split a surrogate pair, producing invalid text.
Common Pitfalls
Confusing code units with bytes: A UTF-16 code unit is 2 bytes, not 1. A string of length
n in a Java String occupies at least 2n bytes.
Assuming 1 code unit = 1 character: A single user-visible character (grapheme cluster) may require multiple code points, each potentially requiring multiple code units.
Slicing strings at byte offsets: UTF-8 continuation bytes begin with 10xxxxxx. Slicing
between them produces invalid UTF-8 sequences.
Quick Facts
| Property | Value |
|---|---|
| UTF-8 code unit size | 8 bits (1 byte) |
| UTF-16 code unit size | 16 bits (2 bytes) |
| UTF-32 code unit size | 32 bits (4 bytes) |
| UTF-8 code units per BMP char | 1–3 |
| UTF-16 code units per BMP char | 1 |
| UTF-16 code units per supplementary char | 2 (surrogate pair) |
| Only fixed-width encoding | UTF-32 |
| Language using UTF-16 internally | Java, JavaScript (V8), C# (.NET), Windows |
相关术语
Unicode 标准 中的更多内容
中日韩——Unicode中统一汉字区块及相关文字系统的统称,CJK统一表意文字包含20,992个以上字符。
The process of mapping Chinese, Japanese, and Korean ideographs that share a …
The individual consonant and vowel components (jamo) of the Korean Hangul writing …
与Unicode同步的国际标准(ISO/IEC 10646),定义相同的字符集和码位,但不包含Unicode额外的算法和属性。
为每种书写系统中的每个字符分配唯一编号(码位)的通用字符编码标准,16.0版本包含154,998个已分配字符。
Normative or informative documents that are integral parts of the Unicode Standard. …
Informational documents published by the Unicode Consortium covering specific topics like security …
定义所有Unicode字符属性的机器可读数据文件集合,包括UnicodeData.txt、Blocks.txt、Scripts.txt等。
除代理码位(U+D800–U+DFFF)之外的所有码位,是可表示实际字符的有效值集合,共1,112,064个。
Unicode标准的主要版本,每次发布均新增字符、文字系统和功能,当前版本为Unicode 16.0(2025年9月)。