エンコーディング

UTF-16

16ビットコード単位1つまたは2つ(2バイトまたは4バイト)を使う可変長 Unicode エンコーディング。Java・JavaScript・Windows の内部で使われています。

· Updated

What is UTF-16?

UTF-16 (Unicode Transformation Format — 16-bit) is a variable-length Unicode encoding that uses either one or two 16-bit code units (2 or 4 bytes) per character. It is the internal string representation used by Java, JavaScript, Windows (Win32 API), macOS Core Foundation, and .NET, making it the dominant encoding in application-level code even though UTF-8 has won the web.

UTF-16 evolved from UCS-2, an earlier fixed-width 2-byte encoding that could only represent the Basic Multilingual Plane (BMP, U+0000–U+FFFF). When Unicode expanded beyond the BMP to include emoji, rare CJK extensions, and historic scripts, UTF-16 extended UCS-2 with surrogate pairs to cover the full range up to U+10FFFF.

How UTF-16 Works

BMP characters (U+0000–U+FFFF): Encoded as a single 16-bit code unit. The byte value directly corresponds to the code point. U+0041 ('A') encodes as 0x0041, U+4E2D ('中') encodes as 0x4E2D.

Supplementary characters (U+10000–U+10FFFF): Encoded as a surrogate pair — two 16-bit code units. The first is a high surrogate (U+D800–U+DBFF), the second is a low surrogate (U+DC00–U+DFFF).

Surrogate pair formula for code point C (C ≥ U+10000): - Subtract 0x10000 to get a 20-bit value V - High surrogate: 0xD800 + (V >> 10) - Low surrogate: 0xDC00 + (V & 0x3FF)

Example: encoding U+1F600 (😀 GRINNING FACE)

V = 0x1F600 - 0x10000 = 0xF600

High surrogate: 0xD800 + (0xF600 >> 10) = 0xD800 + 0x3D = 0xD83D

Low surrogate: 0xDC00 + (0xF600 & 0x3FF) = 0xDC00 + 0 = 0xDC00 → actually 0xDE00 for emoji specifics...

>>> '😀'.encode('utf-16-le')
b'=\xd8\x00\xde'

Byte Order and BOM

UTF-16 comes in two flavors based on byte order (endianness):

  • UTF-16 BE (Big Endian): high byte first. U+0041 → 00 41
  • UTF-16 LE (Little Endian): low byte first. U+0041 → 41 00

A file using UTF-16 without a BOM requires the reader to know the byte order in advance. The Byte Order Mark (U+FEFF) at the start of a file signals the encoding: FE FF means big endian, FF FE means little endian.

Code Examples

text = 'Hello, 😀'

# UTF-16 with BOM (platform default byte order)
utf16_bom = text.encode('utf-16')
print(utf16_bom[:2])  # b'\xff\xfe'  (LE BOM on x86)

# Explicit byte order
utf16_le = text.encode('utf-16-le')
utf16_be = text.encode('utf-16-be')

# Character count vs code unit count
s = '😀'
print(len(s))               # 1 — Python counts Unicode code points
print(len(s.encode('utf-16-le')) // 2)  # 2 — two 16-bit code units
// JavaScript strings are UTF-16 internally
const emoji = '😀';
console.log(emoji.length);          // 2  (two UTF-16 code units!)
console.log([...emoji].length);     // 1  (one Unicode code point)
console.log(emoji.codePointAt(0));  // 128512 (0x1F600) — correct
console.log(emoji.charCodeAt(0));   // 55357 (0xD83D) — high surrogate only

Quick Facts

Property Value
Full Name Unicode Transformation Format — 16-bit
Code unit size 16 bits (2 bytes)
Bytes per character 2 (BMP) or 4 (supplementary)
BOM FE FF (big endian) or FF FE (little endian)
Used by Java, JavaScript, Windows, .NET, macOS
Max code point U+10FFFF
Predecessor UCS-2

Common Pitfalls

The JavaScript length trap. '😀'.length === 2 in JavaScript because .length counts UTF-16 code units, not characters. Use spread ([...str].length) or Array.from(str).length to count actual characters. This affects string slicing too: '😀'[0] returns the lone high surrogate, which is not a valid character.

Lone surrogates. A high surrogate without a following low surrogate (or vice versa) is an ill-formed UTF-16 sequence. This can arise from JavaScript's charCodeAt() / String.fromCharCode() used with emoji, or from incorrectly slicing UTF-16 strings.

Windows paths and UTF-16. Windows APIs use UTF-16LE internally. Python's os.path and pathlib handle this transparently, but programs that interact with the Win32 API via ctypes must pass wide strings (wstr) explicitly.

Confusing UTF-16 with UCS-2. UCS-2 cannot represent supplementary characters at all — it has no surrogate mechanism. A UCS-2 decoder encountering surrogate code units produces garbage rather than emoji.

関連用語

エンコーディング のその他の用語

ASCII

米国情報交換標準符号。0〜127の128文字を扱う7ビットエンコーディングで、制御文字・数字・ラテン文字・基本記号を含みます。

ASCII Art

Visual art created from text characters, originally limited to the 95 printable …

Base64

Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …

Big5

主に台湾と香港で使われる繁体字中国語文字エンコーディングで、約13,000のCJK文字をエンコードします。

EBCDIC

拡張二進化十進数コード。文字範囲が連続していないIBMメインフレームエンコーディングで、金融・企業メインフレームで今も使われています。

EUC-KR

KS X 1001に基づく韓国語文字エンコーディングで、ハングル音節と漢字を2バイトシーケンスにマッピングします。

GB2312 / GB18030

簡体字中国語文字エンコーディングファミリー:GB2312(6,763文字)がGBKを経てGB18030へと発展し、Unicodeと互換性のある中国の国家標準となっています。

IANA 文字セット

IANAが管理する文字エンコーディング名の公式レジストリで、HTTP Content-TypeヘッダーとMIMEで使われます(例:charset=utf-8)。

ISO 8859

異なる言語グループ向けの8ビット1バイトエンコーディングファミリー。ISO 8859-1(Latin-1)はUnicodeの最初の256コードポイントの基礎となりました。

Shift JIS

1バイトのASCII/JISローマ字と2バイトのJIS X 0208漢字を組み合わせた日本語文字エンコーディング。レガシーな日本語システムで今も使われています。