1文字あたり1〜4バイトを使う可変長Unicode エンコーディング。Webの主流エンコーディング（98%以上）で、ASCIIと完全な後方互換性があります。

1文字あたりちょうど4バイトを使う固定長 Unicode エンコーディング。単純ですが空間効率が低く、Python 3（CPython）の内部で使われています。

What is サロゲートペア?

UTF-16で補助文字をエンコードするために使われる2つの16ビットコード単位（上位サロゲートU+D800〜U+DBFF ＋下位サロゲートU+DC00〜U+DFFF）。😀 = D83D DE00。

What is バイト順マーク (BOM)?

テキストストリームの先頭に置かれ、バイト順序とエンコーディングを示すU+FEFF。UTF-16/32では必須ですが、UTF-8では任意かつ非推奨です。

What is コード単位?

エンコーディングの最小単位：UTF-8では8ビットバイト、UTF-16では16ビットワード、UTF-32では32ビットワード。1つの文字が複数のコード単位を必要とする場合があります。

エンコーディング

UTF-16

16ビットコード単位1つまたは2つ（2バイトまたは4バイト）を使う可変長 Unicode エンコーディング。Java・JavaScript・Windows の内部で使われています。

2021-02-15 · Updated 2024-05-10

What is UTF-16?

UTF-16 (Unicode Transformation Format — 16-bit) is a variable-length Unicode encoding that uses either one or two 16-bit code units (2 or 4 bytes) per character. It is the internal string representation used by Java, JavaScript, Windows (Win32 API), macOS Core Foundation, and .NET, making it the dominant encoding in application-level code even though UTF-8 has won the web.

UTF-16 evolved from UCS-2, an earlier fixed-width 2-byte encoding that could only represent the Basic Multilingual Plane (BMP, U+0000–U+FFFF). When Unicode expanded beyond the BMP to include emoji, rare CJK extensions, and historic scripts, UTF-16 extended UCS-2 with surrogate pairs to cover the full range up to U+10FFFF.

How UTF-16 Works

BMP characters (U+0000–U+FFFF): Encoded as a single 16-bit code unit. The byte value directly corresponds to the code point. U+0041 ('A') encodes as 0x0041, U+4E2D ('中') encodes as 0x4E2D.

Supplementary characters (U+10000–U+10FFFF): Encoded as a surrogate pair — two 16-bit code units. The first is a high surrogate (U+D800–U+DBFF), the second is a low surrogate (U+DC00–U+DFFF).

Surrogate pair formula for code point C (C ≥ U+10000): - Subtract 0x10000 to get a 20-bit value V - High surrogate: 0xD800 + (V >> 10) - Low surrogate: 0xDC00 + (V & 0x3FF)

Example: encoding U+1F600 (😀 GRINNING FACE)

V = 0x1F600 - 0x10000 = 0xF600

High surrogate: 0xD800 + (0xF600 >> 10) = 0xD800 + 0x3D = 0xD83D

Low surrogate: 0xDC00 + (0xF600 & 0x3FF) = 0xDC00 + 0 = 0xDC00 → actually 0xDE00 for emoji specifics...

>>> '😀'.encode('utf-16-le')
b'=\xd8\x00\xde'

Byte Order and BOM

UTF-16 comes in two flavors based on byte order (endianness):

UTF-16 BE (Big Endian): high byte first. U+0041 → 00 41
UTF-16 LE (Little Endian): low byte first. U+0041 → 41 00

A file using UTF-16 without a BOM requires the reader to know the byte order in advance. The Byte Order Mark (U+FEFF) at the start of a file signals the encoding: FE FF means big endian, FF FE means little endian.

Code Examples

text = 'Hello, 😀'

# UTF-16 with BOM (platform default byte order)
utf16_bom = text.encode('utf-16')
print(utf16_bom[:2])  # b'\xff\xfe'  (LE BOM on x86)

# Explicit byte order
utf16_le = text.encode('utf-16-le')
utf16_be = text.encode('utf-16-be')

# Character count vs code unit count
s = '😀'
print(len(s))               # 1 — Python counts Unicode code points
print(len(s.encode('utf-16-le')) // 2)  # 2 — two 16-bit code units

// JavaScript strings are UTF-16 internally
const emoji = '😀';
console.log(emoji.length);          // 2  (two UTF-16 code units!)
console.log([...emoji].length);     // 1  (one Unicode code point)
console.log(emoji.codePointAt(0));  // 128512 (0x1F600) — correct
console.log(emoji.charCodeAt(0));   // 55357 (0xD83D) — high surrogate only

Quick Facts

Property	Value
Full Name	Unicode Transformation Format — 16-bit
Code unit size	16 bits (2 bytes)
Bytes per character	2 (BMP) or 4 (supplementary)
BOM	FE FF (big endian) or FF FE (little endian)
Used by	Java, JavaScript, Windows, .NET, macOS
Max code point	U+10FFFF
Predecessor	UCS-2

Common Pitfalls

The JavaScript length trap. '😀'.length === 2 in JavaScript because .length counts UTF-16 code units, not characters. Use spread ([...str].length) or Array.from(str).length to count actual characters. This affects string slicing too: '😀'[0] returns the lone high surrogate, which is not a valid character.

Lone surrogates. A high surrogate without a following low surrogate (or vice versa) is an ill-formed UTF-16 sequence. This can arise from JavaScript's charCodeAt() / String.fromCharCode() used with emoji, or from incorrectly slicing UTF-16 strings.

Windows paths and UTF-16. Windows APIs use UTF-16LE internally. Python's os.path and pathlib handle this transparently, but programs that interact with the Win32 API via ctypes must pass wide strings (wstr) explicitly.

Confusing UTF-16 with UCS-2. UCS-2 cannot represent supplementary characters at all — it has no surrogate mechanism. A UCS-2 decoder encountering surrogate code units produces garbage rather than emoji.

エンコーディングのその他の用語

ASCII

米国情報交換標準符号。0〜127の128文字を扱う7ビットエンコーディングで、制御文字・数字・ラテン文字・基本記号を含みます。

ASCII Art

Visual art created from text characters, originally limited to the 95 printable …

Base64

Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …

Big5

主に台湾と香港で使われる繁体字中国語文字エンコーディングで、約13,000のCJK文字をエンコードします。

EBCDIC

拡張二進化十進数コード。文字範囲が連続していないIBMメインフレームエンコーディングで、金融・企業メインフレームで今も使われています。

EUC-KR

KS X 1001に基づく韓国語文字エンコーディングで、ハングル音節と漢字を2バイトシーケンスにマッピングします。

GB2312 / GB18030

簡体字中国語文字エンコーディングファミリー：GB2312（6,763文字）がGBKを経てGB18030へと発展し、Unicodeと互換性のある中国の国家標準となっています。

IANA 文字セット

IANAが管理する文字エンコーディング名の公式レジストリで、HTTP Content-TypeヘッダーとMIMEで使われます（例：charset=utf-8）。

ISO 8859

異なる言語グループ向けの8ビット1バイトエンコーディングファミリー。ISO 8859-1（Latin-1）はUnicodeの最初の256コードポイントの基礎となりました。

Shift JIS

1バイトのASCII/JISローマ字と2バイトのJIS X 0208漢字を組み合わせた日本語文字エンコーディング。レガシーな日本語システムで今も使われています。

← 用語集へ