GB2312 / GB18030
简体中文字符编码系列:GB2312(6,763字)经GBK演化为GB18030,成为与Unicode兼容的中国强制性国家标准。
What is GB2312?
GB2312 (国家标准 2312, Guójiā Biāozhǔn 2312, "National Standard 2312") is the foundational character encoding standard for Simplified Chinese, established in mainland China in 1980 by the National Standards Bureau. It covers 6,763 Chinese characters in two frequency levels plus 682 non-Chinese characters including Latin letters, Greek alphabet, Cyrillic letters, Japanese kana, punctuation, and special symbols.
GB2312 forms the basis of the Simplified Chinese encoding family: GBK extended it, GB18030 extended GBK further, and all three are backward compatible. Understanding GB2312 means understanding the entire lineage of Simplified Chinese text encoding.
Structure and Coverage
GB2312 organizes its characters in a grid of 94 "zones" (区, qū) by 94 "positions" (位, wèi), called the QB/T 1988 zone-position system:
| Zones | Content |
|---|---|
| 1–9 | Non-Chinese: symbols, numerals, Latin, Greek, Cyrillic, Japanese kana, pinyin |
| 10–15 | Unused |
| 16–55 | Level 1 Chinese characters (3,755 chars, ordered by pinyin) |
| 56–87 | Level 2 Chinese characters (3,008 chars, ordered by radical) |
| 88–94 | Unused |
Level 1 characters cover the most frequently used vocabulary and are sufficient for everyday modern Chinese text. Level 2 characters include rare characters, place names, and specialized vocabulary.
EUC-CN vs. HZ
GB2312 itself defines a zone-position model. It is most commonly encountered in two transfer encodings:
EUC-CN (Extended Unix Code for Chinese): Each Chinese character becomes 2 bytes where each byte has its high bit set (0xA1–0xFE range). Zone Z, position P encodes as (Z + 0xA0) and (P + 0xA0). ASCII bytes are single-byte as usual.
HZ (Hanzi encoding for Internet messages): An older encoding for email where Chinese characters are escaped with ~{ and ~} markers, avoiding 8-bit bytes entirely for compatibility with 7-bit email gateways. Largely obsolete.
Code Examples
# Python: GB2312 encoding
text = '中文' # "Chinese language"
# GB2312 (via EUC-CN)
encoded = text.encode('gb2312')
print(encoded) # b'\xd6\xd0\xce\xc4'
print(len(encoded)) # 4 bytes for 2 characters
# Decoding
b'\xd6\xd0\xce\xc4'.decode('gb2312') # '中文'
# GBK is a superset of GB2312 — usually safe to use GBK
text.encode('gbk') # Same bytes for GB2312 chars
text.encode('gb18030') # Same bytes for GB2312 chars
# Traditional Chinese characters not in GB2312
try:
'繁'.encode('gb2312') # Traditional form not in GB2312
except UnicodeEncodeError:
'繁'.encode('gbk') # GBK covers it
'繁'.encode('gb18030') # GB18030 covers it too
The GBK Extension
GBK (汉字内码规范, Guojia Biaozhun Kuozhan, "extended national standard") extends GB2312 to cover 20,902 Chinese characters, including all characters in the CJK Unified Ideographs block of Unicode 1.1, plus Traditional Chinese characters used in mainland China for personal names. GBK uses the same 2-byte structure as GB2312's EUC-CN but expands the second byte range to include values below 0xA1.
Windows code page CP936 is Microsoft's implementation of GBK.
The GB18030 Standard
GB18030 (issued 2000, revised 2005, updated 2022) is a mandatory national standard in China — all software sold in China must support it. GB18030 extends GBK to cover the full Unicode range (U+0000–U+10FFFF) using:
- 1-byte encoding: 0x00–0x7F (ASCII)
- 2-byte encoding: GB2312/GBK characters
- 4-byte encoding: Supplementary Unicode characters
GB18030 is the only national standard that covers all Unicode code points. A properly implemented GB18030 system can represent any character.
Quick Facts
| Property | Value |
|---|---|
| Full Name | Guójiā Biāozhǔn 2312 (National Standard 2312) |
| Established | 1980, mainland China |
| Characters | 6,763 Chinese + 682 other |
| Transfer encoding | EUC-CN (most common) |
| Superseded by | GBK (1993), GB18030 (2000) |
| Windows code page | CP20936 (for GB2312), CP936 (for GBK) |
| IANA charset | GB2312 |
Common Pitfalls
Confusing GB2312 with GBK and GB18030. Many systems accept any of these as "Chinese encoding," but they differ in coverage. A character in GBK may not be in GB2312. Always prefer GB18030 for new systems that need Simplified Chinese support, as it covers all Unicode characters.
The EUC-CN / GB2312 ambiguity. The IANA charset name "GB2312" is technically the zone-position standard, but in practice it refers to the EUC-CN transfer encoding. Python's 'gb2312' codec and 'euc_cn' codec both decode GB2312 content; they are aliases.
Big5 vs. GB2312 confusion. Traditional Chinese text from Taiwan uses Big5; Simplified Chinese from mainland China uses GB2312/GBK/GB18030. These are mutually incompatible. Mixing them produces garbage output that happens to consist of valid-looking Chinese characters with completely different meanings — sometimes embarrassingly so.
相关术语
编码 中的更多内容
美国信息交换标准代码。7位编码,涵盖128个字符(0–127),包括控制字符、数字、拉丁字母和基本符号。
Visual art created from text characters, originally limited to the 95 printable …
Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …
主要在台湾和香港使用的繁体中文字符编码,收录约13,000个CJK字符。
扩展二进制编码十进制交换码。IBM大型机编码,字母范围不连续,至今仍用于银行和企业大型机。
基于KS X 1001的韩语字符编码,将韩文音节和汉字映射为双字节序列。
由IANA维护的字符编码名称官方注册表,用于HTTP Content-Type头和MIME(如charset=utf-8)。
针对不同语言组的8位单字节编码系列,ISO 8859-1(Latin-1)是Unicode前256个码位的基础。
将单字节ASCII/JIS罗马字与双字节JIS X 0208汉字相结合的日语字符编码,仍在传统日语系统中使用。
仅覆盖BMP(U+0000–U+FFFF)的废弃固定2字节编码,是UTF-16的前身,无法表示补充字符。