문자당 1~4바이트를 사용하는 가변 길이 유니코드 인코딩. 웹의 지배적 인코딩(웹사이트의 98% 이상)으로 ASCII와 완전히 하위 호환됩니다.

문자당 정확히 4바이트를 사용하는 고정 길이 유니코드 인코딩. 단순하지만 공간 효율이 낮으며, Python 3(CPython) 내부에서 사용됩니다.

What is 서로게이트 쌍?

UTF-16에서 보충 문자를 인코딩하기 위해 함께 사용되는 두 개의 16비트 코드 단위(상위 서로게이트 U+D800~U+DBFF + 하위 서로게이트 U+DC00~U+DFFF). 😀 = D83D DE00.

What is 바이트 순서 표시 (BOM)?

텍스트 스트림 앞에 놓여 바이트 순서와 인코딩을 나타내는 U+FEFF. UTF-16/32에서는 필수이며, UTF-8에서는 선택 사항으로 권장되지 않습니다.

What is 코드 단위?

인코딩의 최소 단위: UTF-8에서는 8비트 바이트, UTF-16에서는 16비트 워드, UTF-32에서는 32비트 워드. 하나의 문자가 여러 코드 단위를 필요로 할 수 있습니다.

인코딩

UTF-16

16비트 코드 단위 1개 또는 2개(2바이트 또는 4바이트)를 사용하는 가변 길이 유니코드 인코딩. Java, JavaScript, Windows 내부에서 사용됩니다.

2021-02-15 · Updated 2024-05-10

What is UTF-16?

UTF-16 (Unicode Transformation Format — 16-bit) is a variable-length Unicode encoding that uses either one or two 16-bit code units (2 or 4 bytes) per character. It is the internal string representation used by Java, JavaScript, Windows (Win32 API), macOS Core Foundation, and .NET, making it the dominant encoding in application-level code even though UTF-8 has won the web.

UTF-16 evolved from UCS-2, an earlier fixed-width 2-byte encoding that could only represent the Basic Multilingual Plane (BMP, U+0000–U+FFFF). When Unicode expanded beyond the BMP to include emoji, rare CJK extensions, and historic scripts, UTF-16 extended UCS-2 with surrogate pairs to cover the full range up to U+10FFFF.

How UTF-16 Works

BMP characters (U+0000–U+FFFF): Encoded as a single 16-bit code unit. The byte value directly corresponds to the code point. U+0041 ('A') encodes as 0x0041, U+4E2D ('中') encodes as 0x4E2D.

Supplementary characters (U+10000–U+10FFFF): Encoded as a surrogate pair — two 16-bit code units. The first is a high surrogate (U+D800–U+DBFF), the second is a low surrogate (U+DC00–U+DFFF).

Surrogate pair formula for code point C (C ≥ U+10000): - Subtract 0x10000 to get a 20-bit value V - High surrogate: 0xD800 + (V >> 10) - Low surrogate: 0xDC00 + (V & 0x3FF)

Example: encoding U+1F600 (😀 GRINNING FACE)

V = 0x1F600 - 0x10000 = 0xF600

High surrogate: 0xD800 + (0xF600 >> 10) = 0xD800 + 0x3D = 0xD83D

Low surrogate: 0xDC00 + (0xF600 & 0x3FF) = 0xDC00 + 0 = 0xDC00 → actually 0xDE00 for emoji specifics...

>>> '😀'.encode('utf-16-le')
b'=\xd8\x00\xde'

Byte Order and BOM

UTF-16 comes in two flavors based on byte order (endianness):

UTF-16 BE (Big Endian): high byte first. U+0041 → 00 41
UTF-16 LE (Little Endian): low byte first. U+0041 → 41 00

A file using UTF-16 without a BOM requires the reader to know the byte order in advance. The Byte Order Mark (U+FEFF) at the start of a file signals the encoding: FE FF means big endian, FF FE means little endian.

Code Examples

text = 'Hello, 😀'

# UTF-16 with BOM (platform default byte order)
utf16_bom = text.encode('utf-16')
print(utf16_bom[:2])  # b'\xff\xfe'  (LE BOM on x86)

# Explicit byte order
utf16_le = text.encode('utf-16-le')
utf16_be = text.encode('utf-16-be')

# Character count vs code unit count
s = '😀'
print(len(s))               # 1 — Python counts Unicode code points
print(len(s.encode('utf-16-le')) // 2)  # 2 — two 16-bit code units

// JavaScript strings are UTF-16 internally
const emoji = '😀';
console.log(emoji.length);          // 2  (two UTF-16 code units!)
console.log([...emoji].length);     // 1  (one Unicode code point)
console.log(emoji.codePointAt(0));  // 128512 (0x1F600) — correct
console.log(emoji.charCodeAt(0));   // 55357 (0xD83D) — high surrogate only

Quick Facts

Property	Value
Full Name	Unicode Transformation Format — 16-bit
Code unit size	16 bits (2 bytes)
Bytes per character	2 (BMP) or 4 (supplementary)
BOM	FE FF (big endian) or FF FE (little endian)
Used by	Java, JavaScript, Windows, .NET, macOS
Max code point	U+10FFFF
Predecessor	UCS-2

Common Pitfalls

The JavaScript length trap. '😀'.length === 2 in JavaScript because .length counts UTF-16 code units, not characters. Use spread ([...str].length) or Array.from(str).length to count actual characters. This affects string slicing too: '😀'[0] returns the lone high surrogate, which is not a valid character.

Lone surrogates. A high surrogate without a following low surrogate (or vice versa) is an ill-formed UTF-16 sequence. This can arise from JavaScript's charCodeAt() / String.fromCharCode() used with emoji, or from incorrectly slicing UTF-16 strings.

Windows paths and UTF-16. Windows APIs use UTF-16LE internally. Python's os.path and pathlib handle this transparently, but programs that interact with the Win32 API via ctypes must pass wide strings (wstr) explicitly.

Confusing UTF-16 with UCS-2. UCS-2 cannot represent supplementary characters at all — it has no surrogate mechanism. A UCS-2 decoder encountering surrogate code units produces garbage rather than emoji.