16ビットコード単位1つまたは2つ（2バイトまたは4バイト）を使う可変長 Unicode エンコーディング。Java・JavaScript・Windows の内部で使われています。

1文字あたりちょうど4バイトを使う固定長 Unicode エンコーディング。単純ですが空間効率が低く、Python 3（CPython）の内部で使われています。

What is バイト順マーク (BOM)?

テキストストリームの先頭に置かれ、バイト順序とエンコーディングを示すU+FEFF。UTF-16/32では必須ですが、UTF-8では任意かつ非推奨です。

What is 文字エンコーディング?

文字をデジタル保存・送信用のバイト列にマッピングするシステム。すべてのテキストファイルにはエンコーディングがあり、正しく宣言されているかどうかが重要です。

エンコーディング

UTF-8

1文字あたり1〜4バイトを使う可変長Unicode エンコーディング。Webの主流エンコーディング（98%以上）で、ASCIIと完全な後方互換性があります。

2021-02-08 · Updated 2024-08-22

What is UTF-8?

UTF-8 (Unicode Transformation Format — 8-bit) is a variable-length character encoding for Unicode. It represents each Unicode code point using 1 to 4 bytes, with a clever design that makes it fully backward-compatible with ASCII and self-synchronizing. As of 2024, over 98% of websites use UTF-8, making it the universal default for text on the internet.

UTF-8 was designed by Ken Thompson and Rob Pike in September 1992. The design goals were ambitious: encode all Unicode code points, maintain ASCII backward compatibility, be self-synchronizing (so you can determine character boundaries without reading from the start), and be space-efficient for Latin-script text.

How UTF-8 Works

The encoding uses a variable number of bytes based on the code point value:

Code Point Range	Bytes	Bit Pattern
U+0000–U+007F	1	`0xxxxxxx`
U+0080–U+07FF	2	`110xxxxx 10xxxxxx`
U+0800–U+FFFF	3	`1110xxxx 10xxxxxx 10xxxxxx`
U+10000–U+10FFFF	4	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`

The leading bits of the first byte encode the byte count: 0 means 1 byte, 110 means 2 bytes, 1110 means 3 bytes, 11110 means 4 bytes. Continuation bytes always start with 10, making them immediately distinguishable from start bytes.

Example: encoding U+00E9 (é, Latin small letter e with acute)

U+00E9 = 0xE9 = 233, which falls in the U+0080–U+07FF range (2 bytes).

Binary of 0xE9: 11101001

Split into 5+6 bits: 00011 | 101001

Apply pattern 110xxxxx 10xxxxxx: 11000011 10101001 = 0xC3 0xA9

>>> 'é'.encode('utf-8')
b'\xc3\xa9'
>>> b'\xc3\xa9'.decode('utf-8')
'é'

Self-Synchronization

One of UTF-8's most important properties is that you can determine character boundaries without reading from the start of the stream. Any byte starting with 10xxxxxx is a continuation byte; any other byte begins a new character. If you're dropped into the middle of a UTF-8 stream, you can scan forward until you find a non-continuation byte and know you've found a character boundary.

This property also enables robust error recovery: if a byte is corrupted, the damage is local to that character, not propagated through the rest of the stream.

Code Examples

# Python 3 — strings are Unicode by default
text = 'Hello, 世界!'
encoded = text.encode('utf-8')
print(encoded)
# b'Hello, \xe4\xb8\x96\xe7\x95\x8c!'

# ASCII characters stay as single bytes
print(len('A'.encode('utf-8')))    # 1
print(len('é'.encode('utf-8')))    # 2
print(len('中'.encode('utf-8')))    # 3
print(len('𠀀'.encode('utf-8')))   # 4 (rare CJK extension)

# Reading files: always declare encoding
with open('data.txt', 'r', encoding='utf-8') as f:
    content = f.read()

// Node.js: Buffer handles UTF-8 by default
const buf = Buffer.from('Hello, 世界!', 'utf-8');
console.log(buf.byteLength);  // 14 (7 ASCII + 6 CJK bytes + 1 ! byte)

// TextEncoder/TextDecoder in browser and Node.js
const encoder = new TextEncoder();  // always UTF-8
const bytes = encoder.encode('é');
console.log(bytes);  // Uint8Array [195, 169]

Quick Facts

Property	Value
Full Name	Unicode Transformation Format — 8-bit
Designed by	Ken Thompson, Rob Pike (1992)
Bytes per character	1–4
ASCII compatible	Yes (U+0000–U+007F identical)
Web adoption	~98% of websites (2024)
BOM	Optional (U+FEFF = `EF BB BF`), not recommended
Self-synchronizing	Yes
Standard	RFC 3629, Unicode Standard

Common Pitfalls

Confusing byte length with character length. In Python 3, len('中文') returns 2 (characters), but len('中文'.encode('utf-8')) returns 6 (bytes). Always know whether you're counting characters or bytes.

Assuming one character = one code point. Some characters are made of multiple code points combined (e.g., a base letter + combining diacritical mark). len('é') can be 1 or 2 depending on whether it's NFC or NFD normalized. This is a Unicode normalization issue, not a UTF-8 issue per se.

Opening UTF-8 files without specifying encoding. On Windows, open('file.txt', 'r') defaults to the system code page (often Windows-1252). Always pass encoding='utf-8' explicitly.

The "UTF-8 with BOM" variant. Windows Notepad historically saved UTF-8 files with a 3-byte BOM (EF BB BF). Many parsers fail on this. Prefer UTF-8 without BOM for interchange.

エンコーディングのその他の用語

ASCII

米国情報交換標準符号。0〜127の128文字を扱う7ビットエンコーディングで、制御文字・数字・ラテン文字・基本記号を含みます。

ASCII Art

Visual art created from text characters, originally limited to the 95 printable …

Base64

Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …

Big5

主に台湾と香港で使われる繁体字中国語文字エンコーディングで、約13,000のCJK文字をエンコードします。

EBCDIC

拡張二進化十進数コード。文字範囲が連続していないIBMメインフレームエンコーディングで、金融・企業メインフレームで今も使われています。

EUC-KR

KS X 1001に基づく韓国語文字エンコーディングで、ハングル音節と漢字を2バイトシーケンスにマッピングします。

GB2312 / GB18030

簡体字中国語文字エンコーディングファミリー：GB2312（6,763文字）がGBKを経てGB18030へと発展し、Unicodeと互換性のある中国の国家標準となっています。

IANA 文字セット

IANAが管理する文字エンコーディング名の公式レジストリで、HTTP Content-TypeヘッダーとMIMEで使われます（例：charset=utf-8）。

ISO 8859

異なる言語グループ向けの8ビット1バイトエンコーディングファミリー。ISO 8859-1（Latin-1）はUnicodeの最初の256コードポイントの基礎となりました。

Shift JIS

1バイトのASCII/JISローマ字と2バイトのJIS X 0208漢字を組み合わせた日本語文字エンコーディング。レガシーな日本語システムで今も使われています。

← 用語集へ