인코딩 / 디코딩
인코딩은 문자를 바이트로 변환(str.encode('utf-8'))하고, 디코딩은 바이트를 문자로 변환(bytes.decode('utf-8'))합니다. 이를 올바르게 처리하면 문자 깨짐을 방지할 수 있습니다.
What Is Encoding and Decoding?
Encoding converts a Unicode string (sequence of abstract code points) into bytes. Decoding converts bytes back into a Unicode string. The encoding specifies the mapping: how each code point is represented as a sequence of bytes.
The distinction matters because computers transmit and store bytes, not abstract characters. The string "Hello, 世界" must be encoded before saving to a file, sending over a network, or inserting into a database. When reading it back, you must use the same encoding to decode correctly.
The Most Important Encodings
| Encoding | Variable width | ASCII compat | All Unicode | Common use |
|---|---|---|---|---|
| UTF-8 | Yes (1–4 bytes) | Yes | Yes | Web, Linux, Git, JSON |
| UTF-16 | Yes (2 or 4 bytes) | No | Yes | Windows, Java, JS internals |
| UTF-32 | No (4 bytes) | No | Yes | Internal processing |
| Latin-1 (ISO-8859-1) | No (1 byte) | Yes | No | Legacy Western European |
| ASCII | No (1 byte) | Yes | No | 128 characters only |
Python Encoding and Decoding
# str → bytes (encoding)
text = "Hello, 世界"
utf8_bytes = text.encode("utf-8") # b"Hello, \xe4\xb8\x96\xe7\x95\x8c"
utf16_bytes = text.encode("utf-16") # includes BOM
utf32_bytes = text.encode("utf-32")
# bytes → str (decoding)
utf8_bytes.decode("utf-8") # "Hello, 世界"
utf8_bytes.decode("latin-1") # garbled — wrong codec!
# Error handling
bad_bytes = b"\xff\xfe"
bad_bytes.decode("utf-8", errors="strict") # UnicodeDecodeError
bad_bytes.decode("utf-8", errors="replace") # "\\ufffd\\ufffd" (replacement chars)
bad_bytes.decode("utf-8", errors="ignore") # "" (silently drops bad bytes)
bad_bytes.decode("utf-8", errors="backslashreplace") # "\\xff\\xfe"
# File I/O
with open("file.txt", "w", encoding="utf-8") as f:
f.write("Hello, 世界")
with open("file.txt", "r", encoding="utf-8") as f:
text = f.read()
# Detect encoding (unreliable but useful)
import chardet
result = chardet.detect(unknown_bytes)
encoding = result["encoding"] # e.g., "UTF-8", "ISO-8859-1"
JavaScript Encoding and Decoding
// TextEncoder / TextDecoder (modern, available in browsers and Node.js)
const encoder = new TextEncoder(); // Always UTF-8
const bytes = encoder.encode("Hello, 世界");
// Uint8Array [72, 101, 108, 108, 111, 44, 32, 228, 184, 150, ...]
const decoder = new TextDecoder("utf-8");
decoder.decode(bytes); // "Hello, 世界"
const latin1decoder = new TextDecoder("iso-8859-1");
latin1decoder.decode(bytes); // garbled — wrong codec
// Buffer in Node.js
Buffer.from("Hello, 世界", "utf8");
Buffer.from("Hello, 世界", "utf8").toString("utf8");
The Encoding Mismatch Problem
The single most common Unicode bug is encoding/decoding with the wrong codec:
# File saved as UTF-8, opened as Latin-1
with open("utf8_file.txt", "r", encoding="latin-1") as f:
content = f.read()
# "Hello, 世界" — mojibake: UTF-8 bytes misinterpreted as Latin-1
UTF-8 bytes for CJK characters start with 0xE4–0xEF. In Latin-1, these are accented characters, producing the garbled 世界 output.
Encoding Best Practices
- Always use UTF-8 for new files, APIs, and databases.
- Declare encoding explicitly — never rely on system defaults.
- Handle errors — use
errors="replace"for resilience,errors="strict"for correctness. - Validate input: Decode user-provided bytes early; work with strings internally; encode at output boundaries.
Quick Facts
| Operation | Python | JavaScript |
|---|---|---|
| str → bytes | s.encode("utf-8") |
new TextEncoder().encode(s) |
| bytes → str | b.decode("utf-8") |
new TextDecoder("utf-8").decode(b) |
| Default encoding | System-dependent (set PYTHONUTF8=1) |
TextEncoder always UTF-8 |
| Error modes (Python) | strict replace ignore backslashreplace |
fatal (default) or replacement |
| Recommended | UTF-8 everywhere | UTF-8 everywhere |
관련 용어
프로그래밍 & 개발의 더 많은 용어
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …
Python 3 uses Unicode strings by default (str = UTF-8 internally via …
Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …
U+0000(NUL). 첫 번째 유니코드/ASCII 문자로, C/C++에서 문자열 종료자로 사용됩니다. 보안 위험: 널 …
U+FFFD(). 디코더가 유효하지 않은 바이트 시퀀스를 만났을 때 표시되는 문자 — '디코딩에 …
잘못된 인코딩으로 바이트를 디코딩할 때 생기는 깨진 텍스트. 일본어 용어(文字化け). 예: 'café'를 …
프로그래밍 언어에서 문자의 시퀀스. 내부 표현은 다양합니다: UTF-8(Go, Rust, 최신 Python), UTF-16(Java, …
유니코드 문자열의 '길이'는 단위에 따라 다릅니다: 코드 단위(JavaScript .length), 코드 포인트(Python len()), …
눈에 보이는 글리프가 없는 문자: 공백, 너비 없는 문자, 제어 문자, 서식 …
UTF-16에서 보충 문자를 인코딩하기 위해 함께 사용되는 두 개의 16비트 코드 단위(상위 …