UTF-32
每字符固定使用4字节的Unicode编码,简单但空间效率低,在Python 3(CPython)内部使用。
What is UTF-32?
UTF-32 (Unicode Transformation Format — 32-bit) is a fixed-length Unicode encoding that represents every code point as exactly 4 bytes (a 32-bit integer). Its defining characteristic is simplicity: code point U+XXXX is stored as the 32-bit integer 0x0000XXXX, with no variable-length tricks, no surrogate pairs, and no ambiguity about where characters start and end.
UTF-32 is primarily used as an internal processing format where random access to characters by index matters more than storage efficiency. Python 3 uses a compact variant of UTF-32 internally for strings, and various text-processing applications use it as a working representation precisely because indexing into a UTF-32 string by character position is O(1) rather than O(n).
How UTF-32 Works
The encoding is direct: the code point value becomes the 32-bit integer stored in memory. Like UTF-16, UTF-32 comes in big-endian and little-endian variants, and a BOM (U+FEFF stored as 4 bytes) can signal byte order at the start of a file.
| Code Point | UTF-32 LE bytes | UTF-32 BE bytes |
|---|---|---|
| U+0041 (A) | 41 00 00 00 |
00 00 00 41 |
| U+00E9 (é) | E9 00 00 00 |
00 00 00 E9 |
| U+4E2D (中) | 2D 4E 00 00 |
00 00 4E 2D |
| U+1F600 (😀) | 00 F6 01 00 |
00 01 F6 00 |
Random access advantage: To get the Nth character of a UTF-32 string, simply read 4 bytes at offset N * 4. No scanning, no surrogate-pair awareness, no variable-length parsing.
Code Examples
text = 'Hello, 😀'
# UTF-32 encoding
utf32 = text.encode('utf-32')
print(len(utf32)) # 36: 4-byte BOM + 8 chars * 4 bytes
utf32_le = text.encode('utf-32-le')
print(len(utf32_le)) # 32: 8 chars * 4 bytes, no BOM
# Python's internal representation
import sys
# CPython uses UTF-32 internally on most platforms
s = '中文'
print(sys.getsizeof(s)) # includes UTF-32 data + object overhead
# Random access in native Python strings
s = 'Hello, 😀 world'
print(s[7]) # '😀' — correct, O(1) indexing
print(s[8]) # ' '
// C: wchar_t on Linux is typically 4 bytes (UTF-32)
#include <wchar.h>
wchar_t greeting[] = L"Hello, 😀";
// sizeof(greeting) == (8 + 1) * 4 bytes (including null terminator)
// Random access: greeting[7] == 0x1F600 (emoji code point)
Python's Internal String Representation
Python 3 does not use UTF-32 uniformly. CPython uses three internal formats depending on the highest code point in the string:
- Latin-1 (1 byte/char): if all code points ≤ U+00FF
- UCS-2 (2 bytes/char): if all code points ≤ U+FFFF
- UCS-4 (4 bytes/char): if any code point > U+FFFF
This means sys.getsizeof('abc') differs from sys.getsizeof('abc😀') because the emoji forces the 4-byte representation for the entire string. This design minimizes memory while providing O(1) indexing.
Quick Facts
| Property | Value |
|---|---|
| Full Name | Unicode Transformation Format — 32-bit |
| Bytes per character | Always 4 |
| Code unit size | 32 bits |
| BOM | 00 00 FE FF (big endian) or FF FE 00 00 (little endian) |
| Random access | O(1) by character index |
| Used internally by | Python 3, various text processing systems |
| Space efficiency | Poor (2–4× larger than UTF-8 for ASCII text) |
Common Pitfalls
Confusing code points with grapheme clusters. Even in UTF-32, some visible "characters" are composed of multiple code points: a base character plus combining diacritical marks, or emoji with skin tone modifiers. UTF-32 gives you O(1) access to code points, not to perceived characters. 'é' in NFD form is two code points (U+0065 + U+0301) and thus 8 bytes in UTF-32, not 4.
Assuming UTF-32 is fastest. Random access is O(1), but UTF-32 uses 4× more memory than UTF-8 for ASCII text. Cache efficiency degrades, and I/O takes longer. For most real-world text processing, UTF-8 with occasional decode overhead outperforms UTF-32 on modern hardware.
Byte order ambiguity. A UTF-32 file without a BOM requires the reader to know the byte order. The 4-byte BOM 00 00 FE FF (big endian) and FF FE 00 00 (little endian) are easy to confuse with the UTF-16 BOM if only the first 2 bytes are checked.
相关术语
编码 中的更多内容
美国信息交换标准代码。7位编码,涵盖128个字符(0–127),包括控制字符、数字、拉丁字母和基本符号。
Visual art created from text characters, originally limited to the 95 printable …
Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …
主要在台湾和香港使用的繁体中文字符编码,收录约13,000个CJK字符。
扩展二进制编码十进制交换码。IBM大型机编码,字母范围不连续,至今仍用于银行和企业大型机。
基于KS X 1001的韩语字符编码,将韩文音节和汉字映射为双字节序列。
简体中文字符编码系列:GB2312(6,763字)经GBK演化为GB18030,成为与Unicode兼容的中国强制性国家标准。
由IANA维护的字符编码名称官方注册表,用于HTTP Content-Type头和MIME(如charset=utf-8)。
针对不同语言组的8位单字节编码系列,ISO 8859-1(Latin-1)是Unicode前256个码位的基础。
将单字节ASCII/JIS罗马字与双字节JIS X 0208汉字相结合的日语字符编码,仍在传统日语系统中使用。