1文字あたり1〜4バイトを使う可変長Unicode エンコーディング。Webの主流エンコーディング（98%以上）で、ASCIIと完全な後方互換性があります。

16ビットコード単位1つまたは2つ（2バイトまたは4バイト）を使う可変長 Unicode エンコーディング。Java・JavaScript・Windows の内部で使われています。

What is コードポイント?

Unicodeコード空間（U+0000〜U+10FFFF）内の数値で、U+XXXXと表記します。すべてのコードポイントが文字に割り当てられているわけではありません。

エンコーディング

UTF-32

1文字あたりちょうど4バイトを使う固定長 Unicode エンコーディング。単純ですが空間効率が低く、Python 3（CPython）の内部で使われています。

2021-02-22 · Updated 2024-07-03

What is UTF-32?

UTF-32 (Unicode Transformation Format — 32-bit) is a fixed-length Unicode encoding that represents every code point as exactly 4 bytes (a 32-bit integer). Its defining characteristic is simplicity: code point U+XXXX is stored as the 32-bit integer 0x0000XXXX, with no variable-length tricks, no surrogate pairs, and no ambiguity about where characters start and end.

UTF-32 is primarily used as an internal processing format where random access to characters by index matters more than storage efficiency. Python 3 uses a compact variant of UTF-32 internally for strings, and various text-processing applications use it as a working representation precisely because indexing into a UTF-32 string by character position is O(1) rather than O(n).

How UTF-32 Works

The encoding is direct: the code point value becomes the 32-bit integer stored in memory. Like UTF-16, UTF-32 comes in big-endian and little-endian variants, and a BOM (U+FEFF stored as 4 bytes) can signal byte order at the start of a file.

Code Point	UTF-32 LE bytes	UTF-32 BE bytes
U+0041 (A)	`41 00 00 00`	`00 00 00 41`
U+00E9 (é)	`E9 00 00 00`	`00 00 00 E9`
U+4E2D (中)	`2D 4E 00 00`	`00 00 4E 2D`
U+1F600 (😀)	`00 F6 01 00`	`00 01 F6 00`

Random access advantage: To get the Nth character of a UTF-32 string, simply read 4 bytes at offset N * 4. No scanning, no surrogate-pair awareness, no variable-length parsing.

Code Examples

text = 'Hello, 😀'

# UTF-32 encoding
utf32 = text.encode('utf-32')
print(len(utf32))           # 36: 4-byte BOM + 8 chars * 4 bytes

utf32_le = text.encode('utf-32-le')
print(len(utf32_le))        # 32: 8 chars * 4 bytes, no BOM

# Python's internal representation
import sys
# CPython uses UTF-32 internally on most platforms
s = '中文'
print(sys.getsizeof(s))     # includes UTF-32 data + object overhead

# Random access in native Python strings
s = 'Hello, 😀 world'
print(s[7])   # '😀' — correct, O(1) indexing
print(s[8])   # ' '

// C: wchar_t on Linux is typically 4 bytes (UTF-32)
#include <wchar.h>
wchar_t greeting[] = L"Hello, 😀";
// sizeof(greeting) == (8 + 1) * 4 bytes (including null terminator)
// Random access: greeting[7] == 0x1F600 (emoji code point)

Python's Internal String Representation

Python 3 does not use UTF-32 uniformly. CPython uses three internal formats depending on the highest code point in the string:

Latin-1 (1 byte/char): if all code points ≤ U+00FF
UCS-2 (2 bytes/char): if all code points ≤ U+FFFF
UCS-4 (4 bytes/char): if any code point > U+FFFF

This means sys.getsizeof('abc') differs from sys.getsizeof('abc😀') because the emoji forces the 4-byte representation for the entire string. This design minimizes memory while providing O(1) indexing.

Quick Facts

Property	Value
Full Name	Unicode Transformation Format — 32-bit
Bytes per character	Always 4
Code unit size	32 bits
BOM	00 00 FE FF (big endian) or FF FE 00 00 (little endian)
Random access	O(1) by character index
Used internally by	Python 3, various text processing systems
Space efficiency	Poor (2–4× larger than UTF-8 for ASCII text)

Common Pitfalls

Confusing code points with grapheme clusters. Even in UTF-32, some visible "characters" are composed of multiple code points: a base character plus combining diacritical marks, or emoji with skin tone modifiers. UTF-32 gives you O(1) access to code points, not to perceived characters. 'é' in NFD form is two code points (U+0065 + U+0301) and thus 8 bytes in UTF-32, not 4.

Assuming UTF-32 is fastest. Random access is O(1), but UTF-32 uses 4× more memory than UTF-8 for ASCII text. Cache efficiency degrades, and I/O takes longer. For most real-world text processing, UTF-8 with occasional decode overhead outperforms UTF-32 on modern hardware.

Byte order ambiguity. A UTF-32 file without a BOM requires the reader to know the byte order. The 4-byte BOM 00 00 FE FF (big endian) and FF FE 00 00 (little endian) are easy to confuse with the UTF-16 BOM if only the first 2 bytes are checked.

エンコーディングのその他の用語

ASCII

米国情報交換標準符号。0〜127の128文字を扱う7ビットエンコーディングで、制御文字・数字・ラテン文字・基本記号を含みます。

ASCII Art

Visual art created from text characters, originally limited to the 95 printable …

Base64

Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …

Big5

主に台湾と香港で使われる繁体字中国語文字エンコーディングで、約13,000のCJK文字をエンコードします。

EBCDIC

拡張二進化十進数コード。文字範囲が連続していないIBMメインフレームエンコーディングで、金融・企業メインフレームで今も使われています。

EUC-KR

KS X 1001に基づく韓国語文字エンコーディングで、ハングル音節と漢字を2バイトシーケンスにマッピングします。

GB2312 / GB18030

簡体字中国語文字エンコーディングファミリー：GB2312（6,763文字）がGBKを経てGB18030へと発展し、Unicodeと互換性のある中国の国家標準となっています。

IANA 文字セット

IANAが管理する文字エンコーディング名の公式レジストリで、HTTP Content-TypeヘッダーとMIMEで使われます（例：charset=utf-8）。

ISO 8859

異なる言語グループ向けの8ビット1バイトエンコーディングファミリー。ISO 8859-1（Latin-1）はUnicodeの最初の256コードポイントの基礎となりました。

Shift JIS

1バイトのASCII/JISローマ字と2バイトのJIS X 0208漢字を組み合わせた日本語文字エンコーディング。レガシーな日本語システムで今も使われています。

← 用語集へ