UTF-32
Фиксированная кодировка Unicode, использующая ровно 4 байта на символ. Простая, но неэффективная по памяти; используется внутри Python 3 (CPython).
What is UTF-32?
UTF-32 (Unicode Transformation Format — 32-bit) is a fixed-length Unicode encoding that represents every code point as exactly 4 bytes (a 32-bit integer). Its defining characteristic is simplicity: code point U+XXXX is stored as the 32-bit integer 0x0000XXXX, with no variable-length tricks, no surrogate pairs, and no ambiguity about where characters start and end.
UTF-32 is primarily used as an internal processing format where random access to characters by index matters more than storage efficiency. Python 3 uses a compact variant of UTF-32 internally for strings, and various text-processing applications use it as a working representation precisely because indexing into a UTF-32 string by character position is O(1) rather than O(n).
How UTF-32 Works
The encoding is direct: the code point value becomes the 32-bit integer stored in memory. Like UTF-16, UTF-32 comes in big-endian and little-endian variants, and a BOM (U+FEFF stored as 4 bytes) can signal byte order at the start of a file.
| Code Point | UTF-32 LE bytes | UTF-32 BE bytes |
|---|---|---|
| U+0041 (A) | 41 00 00 00 |
00 00 00 41 |
| U+00E9 (é) | E9 00 00 00 |
00 00 00 E9 |
| U+4E2D (中) | 2D 4E 00 00 |
00 00 4E 2D |
| U+1F600 (😀) | 00 F6 01 00 |
00 01 F6 00 |
Random access advantage: To get the Nth character of a UTF-32 string, simply read 4 bytes at offset N * 4. No scanning, no surrogate-pair awareness, no variable-length parsing.
Code Examples
text = 'Hello, 😀'
# UTF-32 encoding
utf32 = text.encode('utf-32')
print(len(utf32)) # 36: 4-byte BOM + 8 chars * 4 bytes
utf32_le = text.encode('utf-32-le')
print(len(utf32_le)) # 32: 8 chars * 4 bytes, no BOM
# Python's internal representation
import sys
# CPython uses UTF-32 internally on most platforms
s = '中文'
print(sys.getsizeof(s)) # includes UTF-32 data + object overhead
# Random access in native Python strings
s = 'Hello, 😀 world'
print(s[7]) # '😀' — correct, O(1) indexing
print(s[8]) # ' '
// C: wchar_t on Linux is typically 4 bytes (UTF-32)
#include <wchar.h>
wchar_t greeting[] = L"Hello, 😀";
// sizeof(greeting) == (8 + 1) * 4 bytes (including null terminator)
// Random access: greeting[7] == 0x1F600 (emoji code point)
Python's Internal String Representation
Python 3 does not use UTF-32 uniformly. CPython uses three internal formats depending on the highest code point in the string:
- Latin-1 (1 byte/char): if all code points ≤ U+00FF
- UCS-2 (2 bytes/char): if all code points ≤ U+FFFF
- UCS-4 (4 bytes/char): if any code point > U+FFFF
This means sys.getsizeof('abc') differs from sys.getsizeof('abc😀') because the emoji forces the 4-byte representation for the entire string. This design minimizes memory while providing O(1) indexing.
Quick Facts
| Property | Value |
|---|---|
| Full Name | Unicode Transformation Format — 32-bit |
| Bytes per character | Always 4 |
| Code unit size | 32 bits |
| BOM | 00 00 FE FF (big endian) or FF FE 00 00 (little endian) |
| Random access | O(1) by character index |
| Used internally by | Python 3, various text processing systems |
| Space efficiency | Poor (2–4× larger than UTF-8 for ASCII text) |
Common Pitfalls
Confusing code points with grapheme clusters. Even in UTF-32, some visible "characters" are composed of multiple code points: a base character plus combining diacritical marks, or emoji with skin tone modifiers. UTF-32 gives you O(1) access to code points, not to perceived characters. 'é' in NFD form is two code points (U+0065 + U+0301) and thus 8 bytes in UTF-32, not 4.
Assuming UTF-32 is fastest. Random access is O(1), but UTF-32 uses 4× more memory than UTF-8 for ASCII text. Cache efficiency degrades, and I/O takes longer. For most real-world text processing, UTF-8 with occasional decode overhead outperforms UTF-32 on modern hardware.
Byte order ambiguity. A UTF-32 file without a BOM requires the reader to know the byte order. The 4-byte BOM 00 00 FE FF (big endian) and FF FE 00 00 (little endian) are easy to confuse with the UTF-16 BOM if only the first 2 bytes are checked.
Связанные термины
Ещё в Кодировка
American Standard Code for Information Interchange. 7-битная кодировка, охватывающая 128 символов (0–127): …
Visual art created from text characters, originally limited to the 95 printable …
Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …
Кодировка традиционного китайского, используемая в основном на Тайване и в Гонконге, кодирующая …
Extended Binary Coded Decimal Interchange Code. Кодировка мейнфреймов IBM с непоследовательными диапазонами …
Корейская кодировка на основе KS X 1001, отображающая слоги хангыля и ханча …
Семейство кодировок упрощённого китайского: GB2312 (6763 символа) эволюционировала в GBK, затем в …
Семейство 8-битных однобайтовых кодировок для разных языковых групп. ISO 8859-1 (Latin-1) послужила …
Японская кодировка, сочетающая однобайтовый ASCII/JIS Roman с двухбайтовыми кандзи JIS X 0208. …
Устаревшая фиксированная 2-байтовая кодировка, охватывающая только BMP (U+0000–U+FFFF). Предшественник UTF-16, не способный …