1文字あたり1〜4バイトを使う可変長Unicode エンコーディング。Webの主流エンコーディング（98%以上）で、ASCIIと完全な後方互換性があります。

What is 文字化け?

誤ったエンコーディングでバイトをデコードした際に生じる文字化けテキスト。日本語の用語（文字化け）。例：'café'をUTF-8で保存してLatin-1で読むと'cafÃ©'になります。

What is バイト順マーク (BOM)?

テキストストリームの先頭に置かれ、バイト順序とエンコーディングを示すU+FEFF。UTF-16/32では必須ですが、UTF-8では任意かつ非推奨です。

エンコーディング

文字エンコーディング

文字をデジタル保存・送信用のバイト列にマッピングするシステム。すべてのテキストファイルにはエンコーディングがあり、正しく宣言されているかどうかが重要です。

2021-05-24 · Updated 2024-10-30

What is Character Encoding?

Character encoding is the system by which abstract characters (letters, digits, symbols, ideographs) are mapped to numeric values that computers can store and process. Every piece of text in a computer — every email, web page, source code file, database record — exists as a sequence of bytes, and a character encoding is the specification that says which byte patterns correspond to which characters.

Without an agreed-upon character encoding, there is no text: only meaningless bytes. The history of character encoding is a history of different communities independently building their own mappings, the problems that caused when those systems met, and ultimately the creation of Unicode to provide a single universal system.

The Three-Layer Model

Understanding character encoding requires separating three distinct concepts that are often conflated:

1. Character repertoire (the "what"): The set of abstract characters that the system can represent. Unicode's repertoire includes all human writing systems — over 149,000 characters as of Unicode 15.1. ASCII's repertoire is 128 characters.

2. Coded character set (the "which number"): An assignment of a unique number (code point) to each character in the repertoire. In Unicode, the letter 'A' is always U+0041 regardless of how it is stored. In ASCII, 'A' is 65. These numbers are abstract — they do not specify how bytes are arranged in memory.

3. Character encoding form / transfer encoding (the "how stored"): The specification for how code point numbers are serialized into bytes. For Unicode code point U+0041, UTF-8 stores it as a single byte 0x41, UTF-16 stores it as two bytes 0x41 0x00 (LE) or 0x00 0x41 (BE), and UTF-32 stores it as four bytes 0x41 0x00 0x00 0x00 (LE).

For pre-Unicode single-byte encodings like ASCII or ISO 8859-1, the code point number and the byte value are the same, so the distinction collapses. For Unicode with multiple transfer encodings (UTF-8, UTF-16, UTF-32), the distinction is crucial.

Why Encoding Declaration Matters

Text has no intrinsic meaning without its encoding. The byte sequence 0x63 0x61 0x66 0xE9 could mean:

café (if decoded as ISO 8859-1 or Windows-1252)
cafÃ© or error (if decoded as UTF-8 — because 0xE9 is not a valid UTF-8 continuation byte in this position)
Different garbage in Shift-JIS or EUC-KR

This is why encoding declarations are required in HTML (<meta charset="utf-8">), HTTP headers (Content-Type: text/html; charset=utf-8), XML (<?xml version="1.0" encoding="utf-8"?>), and Python source files (# -*- coding: utf-8 -*-, though Python 3 defaults to UTF-8).

Code Examples

# The same string stored differently by encoding
text = 'café'

encodings = ['ascii', 'latin-1', 'utf-8', 'utf-16-le', 'utf-32-le']
for enc in encodings:
    try:
        encoded = text.encode(enc)
        print(f'{enc:12}: {encoded.hex()} ({len(encoded)} bytes)')
    except UnicodeEncodeError as e:
        print(f'{enc:12}: ERROR — {e}')

# ascii      : ERROR — 'ascii' codec can't encode character '\xe9' (é is not ASCII)
# latin-1    : 636166e9 (4 bytes)
# utf-8      : 636166c3a9 (5 bytes — é becomes 2 bytes)
# utf-16-le  : 630065006600e900 (8 bytes — 2 bytes per char)
# utf-32-le  : 63000000650000006600000000e90000 (16 bytes)

<!-- HTML: always declare encoding in the first 1024 bytes -->
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="utf-8">  <!-- Browser uses this to decode the rest of the page -->
  <title>My Page</title>
</head>

The Encoding Detection Problem

When encoding is not declared, software must guess. Encoding detection (charset detection) is an imperfect science:

BOM detection: Byte Order Marks are reliable when present.
Statistical analysis: Libraries like chardet analyze byte frequency distributions and multi-byte patterns to guess encodings. They work well for large samples of natural language text but can be fooled by short strings or unusual content.
Context heuristics: A file served by a Japanese web server with a Japanese domain name is likely Shift-JIS or UTF-8; a file from a Russian mail server is likely KOI8-R or Windows-1251.

The best practice is always to declare the encoding explicitly and never rely on detection.

Quick Facts

Property	Value
Synonym	Charset, codepage, text encoding
Key components	Repertoire, coded character set, encoding form
Universal standard	Unicode (with UTF-8/16/32 encoding forms)
Web default	UTF-8 (WHATWG, RFC 8259)
Python 3 default	UTF-8 for source files and I/O
Detection library	chardet, charset-normalizer (Python)

Common Pitfalls

Conflating "Unicode" with "UTF-8." Unicode is a coded character set (assigning code points to characters). UTF-8 is one encoding form for those code points. A string can be Unicode but stored as UTF-16 or UTF-32 — it is still "Unicode." Saying "encode this string as Unicode" is ambiguous; saying "encode as UTF-8" is precise.

Assuming text files have a consistent encoding. A file might be mostly UTF-8 but contain a few Windows-1252 bytes embedded by a text editor that mixed encodings. The Python errors parameter (errors='replace', errors='ignore', errors='surrogateescape') controls how to handle such mixed content.

The Python 2 vs. 3 transition. Python 2 str was bytes; Python 3 str is Unicode. The most common Python 2→3 migration bug is forgetting to handle encoding explicitly when reading files or making HTTP requests.

エンコーディングのその他の用語

ASCII

米国情報交換標準符号。0〜127の128文字を扱う7ビットエンコーディングで、制御文字・数字・ラテン文字・基本記号を含みます。

ASCII Art

Visual art created from text characters, originally limited to the 95 printable …

Base64

Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …

Big5

主に台湾と香港で使われる繁体字中国語文字エンコーディングで、約13,000のCJK文字をエンコードします。

EBCDIC

拡張二進化十進数コード。文字範囲が連続していないIBMメインフレームエンコーディングで、金融・企業メインフレームで今も使われています。

EUC-KR

KS X 1001に基づく韓国語文字エンコーディングで、ハングル音節と漢字を2バイトシーケンスにマッピングします。

GB2312 / GB18030

簡体字中国語文字エンコーディングファミリー：GB2312（6,763文字）がGBKを経てGB18030へと発展し、Unicodeと互換性のある中国の国家標準となっています。

IANA 文字セット

IANAが管理する文字エンコーディング名の公式レジストリで、HTTP Content-TypeヘッダーとMIMEで使われます（例：charset=utf-8）。

ISO 8859

異なる言語グループ向けの8ビット1バイトエンコーディングファミリー。ISO 8859-1（Latin-1）はUnicodeの最初の256コードポイントの基礎となりました。

Shift JIS

1バイトのASCII/JISローマ字と2バイトのJIS X 0208漢字を組み合わせた日本語文字エンコーディング。レガシーな日本語システムで今も使われています。

← 用語集へ