What is 字符编码?

将字符映射为字节序列以供数字存储和传输的系统。每个文本文件都有编码，关键在于是否正确声明了该编码。

What is CJK（中日韩）?

中日韩——Unicode中统一汉字区块及相关文字系统的统称，CJK统一表意文字包含20,992个以上字符。

编码

GB2312 / GB18030

简体中文字符编码系列：GB2312（6,763字）经GBK演化为GB18030，成为与Unicode兼容的中国强制性国家标准。

2021-05-05 · Updated 2024-04-09

What is GB2312?

GB2312 (国家标准 2312, Guójiā Biāozhǔn 2312, "National Standard 2312") is the foundational character encoding standard for Simplified Chinese, established in mainland China in 1980 by the National Standards Bureau. It covers 6,763 Chinese characters in two frequency levels plus 682 non-Chinese characters including Latin letters, Greek alphabet, Cyrillic letters, Japanese kana, punctuation, and special symbols.

GB2312 forms the basis of the Simplified Chinese encoding family: GBK extended it, GB18030 extended GBK further, and all three are backward compatible. Understanding GB2312 means understanding the entire lineage of Simplified Chinese text encoding.

Structure and Coverage

GB2312 organizes its characters in a grid of 94 "zones" (区, qū) by 94 "positions" (位, wèi), called the QB/T 1988 zone-position system:

Zones	Content
1–9	Non-Chinese: symbols, numerals, Latin, Greek, Cyrillic, Japanese kana, pinyin
10–15	Unused
16–55	Level 1 Chinese characters (3,755 chars, ordered by pinyin)
56–87	Level 2 Chinese characters (3,008 chars, ordered by radical)
88–94	Unused

Level 1 characters cover the most frequently used vocabulary and are sufficient for everyday modern Chinese text. Level 2 characters include rare characters, place names, and specialized vocabulary.

EUC-CN vs. HZ

GB2312 itself defines a zone-position model. It is most commonly encountered in two transfer encodings:

EUC-CN (Extended Unix Code for Chinese): Each Chinese character becomes 2 bytes where each byte has its high bit set (0xA1–0xFE range). Zone Z, position P encodes as (Z + 0xA0) and (P + 0xA0). ASCII bytes are single-byte as usual.

HZ (Hanzi encoding for Internet messages): An older encoding for email where Chinese characters are escaped with ~{ and ~} markers, avoiding 8-bit bytes entirely for compatibility with 7-bit email gateways. Largely obsolete.

Code Examples

# Python: GB2312 encoding
text = '中文'  # "Chinese language"

# GB2312 (via EUC-CN)
encoded = text.encode('gb2312')
print(encoded)          # b'\xd6\xd0\xce\xc4'
print(len(encoded))     # 4 bytes for 2 characters

# Decoding
b'\xd6\xd0\xce\xc4'.decode('gb2312')  # '中文'

# GBK is a superset of GB2312 — usually safe to use GBK
text.encode('gbk')       # Same bytes for GB2312 chars
text.encode('gb18030')   # Same bytes for GB2312 chars

# Traditional Chinese characters not in GB2312
try:
    '繁'.encode('gb2312')  # Traditional form not in GB2312
except UnicodeEncodeError:
    '繁'.encode('gbk')     # GBK covers it
    '繁'.encode('gb18030') # GB18030 covers it too

The GBK Extension

GBK (汉字内码规范, Guojia Biaozhun Kuozhan, "extended national standard") extends GB2312 to cover 20,902 Chinese characters, including all characters in the CJK Unified Ideographs block of Unicode 1.1, plus Traditional Chinese characters used in mainland China for personal names. GBK uses the same 2-byte structure as GB2312's EUC-CN but expands the second byte range to include values below 0xA1.

Windows code page CP936 is Microsoft's implementation of GBK.

The GB18030 Standard

GB18030 (issued 2000, revised 2005, updated 2022) is a mandatory national standard in China — all software sold in China must support it. GB18030 extends GBK to cover the full Unicode range (U+0000–U+10FFFF) using:

1-byte encoding: 0x00–0x7F (ASCII)
2-byte encoding: GB2312/GBK characters
4-byte encoding: Supplementary Unicode characters

GB18030 is the only national standard that covers all Unicode code points. A properly implemented GB18030 system can represent any character.

Quick Facts

Property	Value
Full Name	Guójiā Biāozhǔn 2312 (National Standard 2312)
Established	1980, mainland China
Characters	6,763 Chinese + 682 other
Transfer encoding	EUC-CN (most common)
Superseded by	GBK (1993), GB18030 (2000)
Windows code page	CP20936 (for GB2312), CP936 (for GBK)
IANA charset	GB2312

Common Pitfalls

Confusing GB2312 with GBK and GB18030. Many systems accept any of these as "Chinese encoding," but they differ in coverage. A character in GBK may not be in GB2312. Always prefer GB18030 for new systems that need Simplified Chinese support, as it covers all Unicode characters.

The EUC-CN / GB2312 ambiguity. The IANA charset name "GB2312" is technically the zone-position standard, but in practice it refers to the EUC-CN transfer encoding. Python's 'gb2312' codec and 'euc_cn' codec both decode GB2312 content; they are aliases.

Big5 vs. GB2312 confusion. Traditional Chinese text from Taiwan uses Big5; Simplified Chinese from mainland China uses GB2312/GBK/GB18030. These are mutually incompatible. Mixing them produces garbage output that happens to consist of valid-looking Chinese characters with completely different meanings — sometimes embarrassingly so.