UTF-32
1文字あたりちょうど4バイトを使う固定長 Unicode エンコーディング。単純ですが空間効率が低く、Python 3(CPython)の内部で使われています。
What is UTF-32?
UTF-32 (Unicode Transformation Format — 32-bit) is a fixed-length Unicode encoding that represents every code point as exactly 4 bytes (a 32-bit integer). Its defining characteristic is simplicity: code point U+XXXX is stored as the 32-bit integer 0x0000XXXX, with no variable-length tricks, no surrogate pairs, and no ambiguity about where characters start and end.
UTF-32 is primarily used as an internal processing format where random access to characters by index matters more than storage efficiency. Python 3 uses a compact variant of UTF-32 internally for strings, and various text-processing applications use it as a working representation precisely because indexing into a UTF-32 string by character position is O(1) rather than O(n).
How UTF-32 Works
The encoding is direct: the code point value becomes the 32-bit integer stored in memory. Like UTF-16, UTF-32 comes in big-endian and little-endian variants, and a BOM (U+FEFF stored as 4 bytes) can signal byte order at the start of a file.
| Code Point | UTF-32 LE bytes | UTF-32 BE bytes |
|---|---|---|
| U+0041 (A) | 41 00 00 00 |
00 00 00 41 |
| U+00E9 (é) | E9 00 00 00 |
00 00 00 E9 |
| U+4E2D (中) | 2D 4E 00 00 |
00 00 4E 2D |
| U+1F600 (😀) | 00 F6 01 00 |
00 01 F6 00 |
Random access advantage: To get the Nth character of a UTF-32 string, simply read 4 bytes at offset N * 4. No scanning, no surrogate-pair awareness, no variable-length parsing.
Code Examples
text = 'Hello, 😀'
# UTF-32 encoding
utf32 = text.encode('utf-32')
print(len(utf32)) # 36: 4-byte BOM + 8 chars * 4 bytes
utf32_le = text.encode('utf-32-le')
print(len(utf32_le)) # 32: 8 chars * 4 bytes, no BOM
# Python's internal representation
import sys
# CPython uses UTF-32 internally on most platforms
s = '中文'
print(sys.getsizeof(s)) # includes UTF-32 data + object overhead
# Random access in native Python strings
s = 'Hello, 😀 world'
print(s[7]) # '😀' — correct, O(1) indexing
print(s[8]) # ' '
// C: wchar_t on Linux is typically 4 bytes (UTF-32)
#include <wchar.h>
wchar_t greeting[] = L"Hello, 😀";
// sizeof(greeting) == (8 + 1) * 4 bytes (including null terminator)
// Random access: greeting[7] == 0x1F600 (emoji code point)
Python's Internal String Representation
Python 3 does not use UTF-32 uniformly. CPython uses three internal formats depending on the highest code point in the string:
- Latin-1 (1 byte/char): if all code points ≤ U+00FF
- UCS-2 (2 bytes/char): if all code points ≤ U+FFFF
- UCS-4 (4 bytes/char): if any code point > U+FFFF
This means sys.getsizeof('abc') differs from sys.getsizeof('abc😀') because the emoji forces the 4-byte representation for the entire string. This design minimizes memory while providing O(1) indexing.
Quick Facts
| Property | Value |
|---|---|
| Full Name | Unicode Transformation Format — 32-bit |
| Bytes per character | Always 4 |
| Code unit size | 32 bits |
| BOM | 00 00 FE FF (big endian) or FF FE 00 00 (little endian) |
| Random access | O(1) by character index |
| Used internally by | Python 3, various text processing systems |
| Space efficiency | Poor (2–4× larger than UTF-8 for ASCII text) |
Common Pitfalls
Confusing code points with grapheme clusters. Even in UTF-32, some visible "characters" are composed of multiple code points: a base character plus combining diacritical marks, or emoji with skin tone modifiers. UTF-32 gives you O(1) access to code points, not to perceived characters. 'é' in NFD form is two code points (U+0065 + U+0301) and thus 8 bytes in UTF-32, not 4.
Assuming UTF-32 is fastest. Random access is O(1), but UTF-32 uses 4× more memory than UTF-8 for ASCII text. Cache efficiency degrades, and I/O takes longer. For most real-world text processing, UTF-8 with occasional decode overhead outperforms UTF-32 on modern hardware.
Byte order ambiguity. A UTF-32 file without a BOM requires the reader to know the byte order. The 4-byte BOM 00 00 FE FF (big endian) and FF FE 00 00 (little endian) are easy to confuse with the UTF-16 BOM if only the first 2 bytes are checked.
関連用語
エンコーディング のその他の用語
米国情報交換標準符号。0〜127の128文字を扱う7ビットエンコーディングで、制御文字・数字・ラテン文字・基本記号を含みます。
Visual art created from text characters, originally limited to the 95 printable …
Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …
主に台湾と香港で使われる繁体字中国語文字エンコーディングで、約13,000のCJK文字をエンコードします。
拡張二進化十進数コード。文字範囲が連続していないIBMメインフレームエンコーディングで、金融・企業メインフレームで今も使われています。
KS X 1001に基づく韓国語文字エンコーディングで、ハングル音節と漢字を2バイトシーケンスにマッピングします。
簡体字中国語文字エンコーディングファミリー:GB2312(6,763文字)がGBKを経てGB18030へと発展し、Unicodeと互換性のある中国の国家標準となっています。
IANAが管理する文字エンコーディング名の公式レジストリで、HTTP Content-TypeヘッダーとMIMEで使われます(例:charset=utf-8)。
異なる言語グループ向けの8ビット1バイトエンコーディングファミリー。ISO 8859-1(Latin-1)はUnicodeの最初の256コードポイントの基礎となりました。
1バイトのASCII/JISローマ字と2バイトのJIS X 0208漢字を組み合わせた日本語文字エンコーディング。レガシーな日本語システムで今も使われています。