GB2312 / GB18030
Simplified Chinese character encoding family: GB2312 (6,763 characters) evolved to GBK and then GB18030, the mandatory Unicode-compatible Chinese national standard.
What is GB2312?
GB2312 (国家标准 2312, Guójiā Biāozhǔn 2312, "National Standard 2312") is the foundational character encoding standard for Simplified Chinese, established in mainland China in 1980 by the National Standards Bureau. It covers 6,763 Chinese characters in two frequency levels plus 682 non-Chinese characters including Latin letters, Greek alphabet, Cyrillic letters, Japanese kana, punctuation, and special symbols.
GB2312 forms the basis of the Simplified Chinese encoding family: GBK extended it, GB18030 extended GBK further, and all three are backward compatible. Understanding GB2312 means understanding the entire lineage of Simplified Chinese text encoding.
Structure and Coverage
GB2312 organizes its characters in a grid of 94 "zones" (区, qū) by 94 "positions" (位, wèi), called the QB/T 1988 zone-position system:
| Zones | Content |
|---|---|
| 1–9 | Non-Chinese: symbols, numerals, Latin, Greek, Cyrillic, Japanese kana, pinyin |
| 10–15 | Unused |
| 16–55 | Level 1 Chinese characters (3,755 chars, ordered by pinyin) |
| 56–87 | Level 2 Chinese characters (3,008 chars, ordered by radical) |
| 88–94 | Unused |
Level 1 characters cover the most frequently used vocabulary and are sufficient for everyday modern Chinese text. Level 2 characters include rare characters, place names, and specialized vocabulary.
EUC-CN vs. HZ
GB2312 itself defines a zone-position model. It is most commonly encountered in two transfer encodings:
EUC-CN (Extended Unix Code for Chinese): Each Chinese character becomes 2 bytes where each byte has its high bit set (0xA1–0xFE range). Zone Z, position P encodes as (Z + 0xA0) and (P + 0xA0). ASCII bytes are single-byte as usual.
HZ (Hanzi encoding for Internet messages): An older encoding for email where Chinese characters are escaped with ~{ and ~} markers, avoiding 8-bit bytes entirely for compatibility with 7-bit email gateways. Largely obsolete.
Code Examples
# Python: GB2312 encoding
text = '中文' # "Chinese language"
# GB2312 (via EUC-CN)
encoded = text.encode('gb2312')
print(encoded) # b'\xd6\xd0\xce\xc4'
print(len(encoded)) # 4 bytes for 2 characters
# Decoding
b'\xd6\xd0\xce\xc4'.decode('gb2312') # '中文'
# GBK is a superset of GB2312 — usually safe to use GBK
text.encode('gbk') # Same bytes for GB2312 chars
text.encode('gb18030') # Same bytes for GB2312 chars
# Traditional Chinese characters not in GB2312
try:
'繁'.encode('gb2312') # Traditional form not in GB2312
except UnicodeEncodeError:
'繁'.encode('gbk') # GBK covers it
'繁'.encode('gb18030') # GB18030 covers it too
The GBK Extension
GBK (汉字内码规范, Guojia Biaozhun Kuozhan, "extended national standard") extends GB2312 to cover 20,902 Chinese characters, including all characters in the CJK Unified Ideographs block of Unicode 1.1, plus Traditional Chinese characters used in mainland China for personal names. GBK uses the same 2-byte structure as GB2312's EUC-CN but expands the second byte range to include values below 0xA1.
Windows code page CP936 is Microsoft's implementation of GBK.
The GB18030 Standard
GB18030 (issued 2000, revised 2005, updated 2022) is a mandatory national standard in China — all software sold in China must support it. GB18030 extends GBK to cover the full Unicode range (U+0000–U+10FFFF) using:
- 1-byte encoding: 0x00–0x7F (ASCII)
- 2-byte encoding: GB2312/GBK characters
- 4-byte encoding: Supplementary Unicode characters
GB18030 is the only national standard that covers all Unicode code points. A properly implemented GB18030 system can represent any character.
Quick Facts
| Property | Value |
|---|---|
| Full Name | Guójiā Biāozhǔn 2312 (National Standard 2312) |
| Established | 1980, mainland China |
| Characters | 6,763 Chinese + 682 other |
| Transfer encoding | EUC-CN (most common) |
| Superseded by | GBK (1993), GB18030 (2000) |
| Windows code page | CP20936 (for GB2312), CP936 (for GBK) |
| IANA charset | GB2312 |
Common Pitfalls
Confusing GB2312 with GBK and GB18030. Many systems accept any of these as "Chinese encoding," but they differ in coverage. A character in GBK may not be in GB2312. Always prefer GB18030 for new systems that need Simplified Chinese support, as it covers all Unicode characters.
The EUC-CN / GB2312 ambiguity. The IANA charset name "GB2312" is technically the zone-position standard, but in practice it refers to the EUC-CN transfer encoding. Python's 'gb2312' codec and 'euc_cn' codec both decode GB2312 content; they are aliases.
Big5 vs. GB2312 confusion. Traditional Chinese text from Taiwan uses Big5; Simplified Chinese from mainland China uses GB2312/GBK/GB18030. These are mutually incompatible. Mixing them produces garbage output that happens to consist of valid-looking Chinese characters with completely different meanings — sometimes embarrassingly so.
Related Terms
More in Encoding
American Standard Code for Information Interchange. 7-bit encoding covering 128 characters (0–127): …
Visual art created from text characters, originally limited to the 95 printable …
Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …
Traditional Chinese character encoding used primarily in Taiwan and Hong Kong, encoding …
U+FEFF placed at the start of a text stream to indicate byte …
A system that maps characters to byte sequences for digital storage and …
Extended Binary Coded Decimal Interchange Code. IBM mainframe encoding with non-contiguous letter …
Korean character encoding based on KS X 1001, mapping Hangul syllables and …
Official registry of character encoding names maintained by IANA, used in HTTP …
Family of 8-bit single-byte encodings for different language groups. ISO 8859-1 (Latin-1) …