UTF-16
16비트 코드 단위 1개 또는 2개(2바이트 또는 4바이트)를 사용하는 가변 길이 유니코드 인코딩. Java, JavaScript, Windows 내부에서 사용됩니다.
What is UTF-16?
UTF-16 (Unicode Transformation Format — 16-bit) is a variable-length Unicode encoding that uses either one or two 16-bit code units (2 or 4 bytes) per character. It is the internal string representation used by Java, JavaScript, Windows (Win32 API), macOS Core Foundation, and .NET, making it the dominant encoding in application-level code even though UTF-8 has won the web.
UTF-16 evolved from UCS-2, an earlier fixed-width 2-byte encoding that could only represent the Basic Multilingual Plane (BMP, U+0000–U+FFFF). When Unicode expanded beyond the BMP to include emoji, rare CJK extensions, and historic scripts, UTF-16 extended UCS-2 with surrogate pairs to cover the full range up to U+10FFFF.
How UTF-16 Works
BMP characters (U+0000–U+FFFF): Encoded as a single 16-bit code unit. The byte value directly corresponds to the code point. U+0041 ('A') encodes as 0x0041, U+4E2D ('中') encodes as 0x4E2D.
Supplementary characters (U+10000–U+10FFFF): Encoded as a surrogate pair — two 16-bit code units. The first is a high surrogate (U+D800–U+DBFF), the second is a low surrogate (U+DC00–U+DFFF).
Surrogate pair formula for code point C (C ≥ U+10000):
- Subtract 0x10000 to get a 20-bit value V
- High surrogate: 0xD800 + (V >> 10)
- Low surrogate: 0xDC00 + (V & 0x3FF)
Example: encoding U+1F600 (😀 GRINNING FACE)
V = 0x1F600 - 0x10000 = 0xF600
High surrogate: 0xD800 + (0xF600 >> 10) = 0xD800 + 0x3D = 0xD83D
Low surrogate: 0xDC00 + (0xF600 & 0x3FF) = 0xDC00 + 0 = 0xDC00 → actually 0xDE00 for emoji specifics...
>>> '😀'.encode('utf-16-le')
b'=\xd8\x00\xde'
Byte Order and BOM
UTF-16 comes in two flavors based on byte order (endianness):
- UTF-16 BE (Big Endian): high byte first. U+0041 →
00 41 - UTF-16 LE (Little Endian): low byte first. U+0041 →
41 00
A file using UTF-16 without a BOM requires the reader to know the byte order in advance. The Byte Order Mark (U+FEFF) at the start of a file signals the encoding: FE FF means big endian, FF FE means little endian.
Code Examples
text = 'Hello, 😀'
# UTF-16 with BOM (platform default byte order)
utf16_bom = text.encode('utf-16')
print(utf16_bom[:2]) # b'\xff\xfe' (LE BOM on x86)
# Explicit byte order
utf16_le = text.encode('utf-16-le')
utf16_be = text.encode('utf-16-be')
# Character count vs code unit count
s = '😀'
print(len(s)) # 1 — Python counts Unicode code points
print(len(s.encode('utf-16-le')) // 2) # 2 — two 16-bit code units
// JavaScript strings are UTF-16 internally
const emoji = '😀';
console.log(emoji.length); // 2 (two UTF-16 code units!)
console.log([...emoji].length); // 1 (one Unicode code point)
console.log(emoji.codePointAt(0)); // 128512 (0x1F600) — correct
console.log(emoji.charCodeAt(0)); // 55357 (0xD83D) — high surrogate only
Quick Facts
| Property | Value |
|---|---|
| Full Name | Unicode Transformation Format — 16-bit |
| Code unit size | 16 bits (2 bytes) |
| Bytes per character | 2 (BMP) or 4 (supplementary) |
| BOM | FE FF (big endian) or FF FE (little endian) |
| Used by | Java, JavaScript, Windows, .NET, macOS |
| Max code point | U+10FFFF |
| Predecessor | UCS-2 |
Common Pitfalls
The JavaScript length trap. '😀'.length === 2 in JavaScript because .length counts UTF-16 code units, not characters. Use spread ([...str].length) or Array.from(str).length to count actual characters. This affects string slicing too: '😀'[0] returns the lone high surrogate, which is not a valid character.
Lone surrogates. A high surrogate without a following low surrogate (or vice versa) is an ill-formed UTF-16 sequence. This can arise from JavaScript's charCodeAt() / String.fromCharCode() used with emoji, or from incorrectly slicing UTF-16 strings.
Windows paths and UTF-16. Windows APIs use UTF-16LE internally. Python's os.path and pathlib handle this transparently, but programs that interact with the Win32 API via ctypes must pass wide strings (wstr) explicitly.
Confusing UTF-16 with UCS-2. UCS-2 cannot represent supplementary characters at all — it has no surrogate mechanism. A UCS-2 decoder encountering surrogate code units produces garbage rather than emoji.
관련 용어
인코딩의 더 많은 용어
미국 정보 교환 표준 부호. 0~127의 128개 문자를 다루는 7비트 인코딩으로, 제어 …
Visual art created from text characters, originally limited to the 95 printable …
Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …
주로 대만과 홍콩에서 사용되는 번체 중국어 문자 인코딩으로, 약 13,000개의 CJK 문자를 …
확장 이진화 십진법 교환 부호. 문자 범위가 연속적이지 않은 IBM 메인프레임 인코딩으로, …
KS X 1001 기반의 한국어 문자 인코딩으로, 한글 음절과 한자를 2바이트 시퀀스에 …
간체 중국어 문자 인코딩 체계: GB2312(6,763자)에서 GBK를 거쳐 GB18030으로 발전하였으며, 유니코드와 호환되는 …
IANA가 관리하는 문자 인코딩 이름의 공식 레지스트리로, HTTP Content-Type 헤더와 MIME에서 사용됩니다(예: …
서로 다른 언어권을 위한 8비트 단일 바이트 인코딩 모음. ISO 8859-1(Latin-1)은 유니코드 …
단일 바이트 ASCII/JIS 로만과 이중 바이트 JIS X 0208 한자를 결합한 일본어 …