UTF-32
문자당 정확히 4바이트를 사용하는 고정 길이 유니코드 인코딩. 단순하지만 공간 효율이 낮으며, Python 3(CPython) 내부에서 사용됩니다.
What is UTF-32?
UTF-32 (Unicode Transformation Format — 32-bit) is a fixed-length Unicode encoding that represents every code point as exactly 4 bytes (a 32-bit integer). Its defining characteristic is simplicity: code point U+XXXX is stored as the 32-bit integer 0x0000XXXX, with no variable-length tricks, no surrogate pairs, and no ambiguity about where characters start and end.
UTF-32 is primarily used as an internal processing format where random access to characters by index matters more than storage efficiency. Python 3 uses a compact variant of UTF-32 internally for strings, and various text-processing applications use it as a working representation precisely because indexing into a UTF-32 string by character position is O(1) rather than O(n).
How UTF-32 Works
The encoding is direct: the code point value becomes the 32-bit integer stored in memory. Like UTF-16, UTF-32 comes in big-endian and little-endian variants, and a BOM (U+FEFF stored as 4 bytes) can signal byte order at the start of a file.
| Code Point | UTF-32 LE bytes | UTF-32 BE bytes |
|---|---|---|
| U+0041 (A) | 41 00 00 00 |
00 00 00 41 |
| U+00E9 (é) | E9 00 00 00 |
00 00 00 E9 |
| U+4E2D (中) | 2D 4E 00 00 |
00 00 4E 2D |
| U+1F600 (😀) | 00 F6 01 00 |
00 01 F6 00 |
Random access advantage: To get the Nth character of a UTF-32 string, simply read 4 bytes at offset N * 4. No scanning, no surrogate-pair awareness, no variable-length parsing.
Code Examples
text = 'Hello, 😀'
# UTF-32 encoding
utf32 = text.encode('utf-32')
print(len(utf32)) # 36: 4-byte BOM + 8 chars * 4 bytes
utf32_le = text.encode('utf-32-le')
print(len(utf32_le)) # 32: 8 chars * 4 bytes, no BOM
# Python's internal representation
import sys
# CPython uses UTF-32 internally on most platforms
s = '中文'
print(sys.getsizeof(s)) # includes UTF-32 data + object overhead
# Random access in native Python strings
s = 'Hello, 😀 world'
print(s[7]) # '😀' — correct, O(1) indexing
print(s[8]) # ' '
// C: wchar_t on Linux is typically 4 bytes (UTF-32)
#include <wchar.h>
wchar_t greeting[] = L"Hello, 😀";
// sizeof(greeting) == (8 + 1) * 4 bytes (including null terminator)
// Random access: greeting[7] == 0x1F600 (emoji code point)
Python's Internal String Representation
Python 3 does not use UTF-32 uniformly. CPython uses three internal formats depending on the highest code point in the string:
- Latin-1 (1 byte/char): if all code points ≤ U+00FF
- UCS-2 (2 bytes/char): if all code points ≤ U+FFFF
- UCS-4 (4 bytes/char): if any code point > U+FFFF
This means sys.getsizeof('abc') differs from sys.getsizeof('abc😀') because the emoji forces the 4-byte representation for the entire string. This design minimizes memory while providing O(1) indexing.
Quick Facts
| Property | Value |
|---|---|
| Full Name | Unicode Transformation Format — 32-bit |
| Bytes per character | Always 4 |
| Code unit size | 32 bits |
| BOM | 00 00 FE FF (big endian) or FF FE 00 00 (little endian) |
| Random access | O(1) by character index |
| Used internally by | Python 3, various text processing systems |
| Space efficiency | Poor (2–4× larger than UTF-8 for ASCII text) |
Common Pitfalls
Confusing code points with grapheme clusters. Even in UTF-32, some visible "characters" are composed of multiple code points: a base character plus combining diacritical marks, or emoji with skin tone modifiers. UTF-32 gives you O(1) access to code points, not to perceived characters. 'é' in NFD form is two code points (U+0065 + U+0301) and thus 8 bytes in UTF-32, not 4.
Assuming UTF-32 is fastest. Random access is O(1), but UTF-32 uses 4× more memory than UTF-8 for ASCII text. Cache efficiency degrades, and I/O takes longer. For most real-world text processing, UTF-8 with occasional decode overhead outperforms UTF-32 on modern hardware.
Byte order ambiguity. A UTF-32 file without a BOM requires the reader to know the byte order. The 4-byte BOM 00 00 FE FF (big endian) and FF FE 00 00 (little endian) are easy to confuse with the UTF-16 BOM if only the first 2 bytes are checked.
관련 용어
인코딩의 더 많은 용어
미국 정보 교환 표준 부호. 0~127의 128개 문자를 다루는 7비트 인코딩으로, 제어 …
Visual art created from text characters, originally limited to the 95 printable …
Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …
주로 대만과 홍콩에서 사용되는 번체 중국어 문자 인코딩으로, 약 13,000개의 CJK 문자를 …
확장 이진화 십진법 교환 부호. 문자 범위가 연속적이지 않은 IBM 메인프레임 인코딩으로, …
KS X 1001 기반의 한국어 문자 인코딩으로, 한글 음절과 한자를 2바이트 시퀀스에 …
간체 중국어 문자 인코딩 체계: GB2312(6,763자)에서 GBK를 거쳐 GB18030으로 발전하였으며, 유니코드와 호환되는 …
IANA가 관리하는 문자 인코딩 이름의 공식 레지스트리로, HTTP Content-Type 헤더와 MIME에서 사용됩니다(예: …
서로 다른 언어권을 위한 8비트 단일 바이트 인코딩 모음. ISO 8859-1(Latin-1)은 유니코드 …
단일 바이트 ASCII/JIS 로만과 이중 바이트 JIS X 0208 한자를 결합한 일본어 …