What is การเข้ารหัสอักขระ?

ระบบที่แมปอักขระเป็นลำดับไบต์สำหรับการจัดเก็บและส่งผ่านข้อมูลดิจิทัล ทุกไฟล์ข้อความมีการเข้ารหัส คำถามคือมีการประกาศอย่างถูกต้องหรือไม่

การเข้ารหัส Unicode แบบความยาวแปรผันที่ใช้ 1–4 ไบต์ต่ออักขระ เป็นการเข้ารหัสที่นิยมใช้มากที่สุดบนเว็บ (มากกว่า 98% ของเว็บไซต์) และรองรับ ASCII แบบย้อนหลังอย่างสมบูรณ์

การเขียนโปรแกรมและการพัฒนา

การเข้ารหัส / การถอดรหัส

การเข้ารหัสแปลงอักขระเป็นไบต์ (str.encode('utf-8')); การถอดรหัสแปลงไบต์เป็นอักขระ (bytes.decode('utf-8')) การทำอย่างถูกต้องช่วยป้องกัน mojibake

2024-07-08 · Updated 2024-12-30

What Is Encoding and Decoding?

Encoding converts a Unicode string (sequence of abstract code points) into bytes. Decoding converts bytes back into a Unicode string. The encoding specifies the mapping: how each code point is represented as a sequence of bytes.

The distinction matters because computers transmit and store bytes, not abstract characters. The string "Hello, 世界" must be encoded before saving to a file, sending over a network, or inserting into a database. When reading it back, you must use the same encoding to decode correctly.

The Most Important Encodings

Encoding	Variable width	ASCII compat	All Unicode	Common use
UTF-8	Yes (1–4 bytes)	Yes	Yes	Web, Linux, Git, JSON
UTF-16	Yes (2 or 4 bytes)	No	Yes	Windows, Java, JS internals
UTF-32	No (4 bytes)	No	Yes	Internal processing
Latin-1 (ISO-8859-1)	No (1 byte)	Yes	No	Legacy Western European
ASCII	No (1 byte)	Yes	No	128 characters only

Python Encoding and Decoding

# str → bytes (encoding)
text = "Hello, 世界"
utf8_bytes  = text.encode("utf-8")     # b"Hello, \xe4\xb8\x96\xe7\x95\x8c"
utf16_bytes = text.encode("utf-16")    # includes BOM
utf32_bytes = text.encode("utf-32")

# bytes → str (decoding)
utf8_bytes.decode("utf-8")             # "Hello, 世界"
utf8_bytes.decode("latin-1")           # garbled — wrong codec!

# Error handling
bad_bytes = b"\xff\xfe"
bad_bytes.decode("utf-8", errors="strict")   # UnicodeDecodeError
bad_bytes.decode("utf-8", errors="replace")  # "\\ufffd\\ufffd" (replacement chars)
bad_bytes.decode("utf-8", errors="ignore")   # "" (silently drops bad bytes)
bad_bytes.decode("utf-8", errors="backslashreplace")  # "\\xff\\xfe"

# File I/O
with open("file.txt", "w", encoding="utf-8") as f:
    f.write("Hello, 世界")

with open("file.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Detect encoding (unreliable but useful)
import chardet
result = chardet.detect(unknown_bytes)
encoding = result["encoding"]  # e.g., "UTF-8", "ISO-8859-1"

JavaScript Encoding and Decoding

// TextEncoder / TextDecoder (modern, available in browsers and Node.js)
const encoder = new TextEncoder();  // Always UTF-8
const bytes = encoder.encode("Hello, 世界");
// Uint8Array [72, 101, 108, 108, 111, 44, 32, 228, 184, 150, ...]

const decoder = new TextDecoder("utf-8");
decoder.decode(bytes);  // "Hello, 世界"

const latin1decoder = new TextDecoder("iso-8859-1");
latin1decoder.decode(bytes);  // garbled — wrong codec

// Buffer in Node.js
Buffer.from("Hello, 世界", "utf8");
Buffer.from("Hello, 世界", "utf8").toString("utf8");

The Encoding Mismatch Problem

The single most common Unicode bug is encoding/decoding with the wrong codec:

# File saved as UTF-8, opened as Latin-1
with open("utf8_file.txt", "r", encoding="latin-1") as f:
    content = f.read()
# "Hello, ä¸–ç•Œ" — mojibake: UTF-8 bytes misinterpreted as Latin-1

UTF-8 bytes for CJK characters start with 0xE4–0xEF. In Latin-1, these are accented characters, producing the garbled ä¸–ç•Œ output.

Encoding Best Practices

Always use UTF-8 for new files, APIs, and databases.
Declare encoding explicitly — never rely on system defaults.
Handle errors — use errors="replace" for resilience, errors="strict" for correctness.
Validate input: Decode user-provided bytes early; work with strings internally; encode at output boundaries.

Quick Facts

Operation	Python	JavaScript
str → bytes	`s.encode("utf-8")`	`new TextEncoder().encode(s)`
bytes → str	`b.decode("utf-8")`	`new TextDecoder("utf-8").decode(b)`
Default encoding	System-dependent (set `PYTHONUTF8=1`)	TextEncoder always UTF-8
Error modes (Python)	`strict` `replace` `ignore` `backslashreplace`	`fatal` (default) or `replacement`
Recommended	UTF-8 everywhere	UTF-8 everywhere

คำศัพท์ที่เกี่ยวข้อง

การเข้ารหัสอักขระ UTF-8 Mojibake

เพิ่มเติมใน การเขียนโปรแกรมและการพัฒนา

Java Unicode

Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …

Mojibake

ข้อความที่เสียหายจากการถอดรหัสไบต์ด้วยการเข้ารหัสผิด คำภาษาญี่ปุ่น (文字化け) ตัวอย่าง: 'café' เก็บเป็น UTF-8 แต่อ่านเป็น Latin-1 → 'cafÃ©'

Python Unicode

Python 3 uses Unicode strings by default (str = UTF-8 internally via …

Rust Unicode

Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …

ความกำกวมของความยาวสตริง

"ความยาว" ของสตริง Unicode ขึ้นอยู่กับหน่วย: code unit (JavaScript .length), code point (Python len()) …

คู่ตัวแทน

หน่วยโค้ด 16 บิตสองตัว (high surrogate U+D800–U+DBFF + low surrogate U+DC00–U+DFFF) ที่เข้ารหัสอักขระเสริมใน UTF-16 …

นิพจน์ทั่วไป Unicode

รูปแบบ regex ที่ใช้คุณสมบัติ Unicode: \p{L} (ตัวอักษรใดก็ได้), \p{Script=Greek} (อักษรกรีก), \p{Emoji} การรองรับแตกต่างกันตามภาษาและ regex engine

ลำดับ escape ของ Unicode

ไวยากรณ์สำหรับแทนอักขระ Unicode ในซอร์สโค้ด แตกต่างกันตามภาษา: \u2713 (Python/Java/JS), \u{2713} (JS/Ruby/Rust), \U00012345 (Python/C)

สตริง

ลำดับของอักขระในภาษาโปรแกรม การแทนค่าภายในแตกต่างกัน: UTF-8 (Go, Rust, Python บิลด์ใหม่), UTF-16 (Java, JavaScript, C#) หรือ …

อักขระทดแทน

U+FFFD (�) แสดงเมื่อตัวถอดรหัสพบลำดับไบต์ที่ไม่ถูกต้อง เป็นสัญลักษณ์สากลสำหรับ "มีบางอย่างผิดพลาดกับการถอดรหัส"

← กลับไปยังอภิธานศัพท์