الترميز

GB2312 / GB18030

عائلة ترميز الأحرف الصينية المبسطة: تطور GB2312 (6,763 حرفًا) إلى GBK ثم GB18030، المعيار الوطني الصيني الإلزامي المتوافق مع Unicode.

· Updated

What is GB2312?

GB2312 (国家标准 2312, Guójiā Biāozhǔn 2312, "National Standard 2312") is the foundational character encoding standard for Simplified Chinese, established in mainland China in 1980 by the National Standards Bureau. It covers 6,763 Chinese characters in two frequency levels plus 682 non-Chinese characters including Latin letters, Greek alphabet, Cyrillic letters, Japanese kana, punctuation, and special symbols.

GB2312 forms the basis of the Simplified Chinese encoding family: GBK extended it, GB18030 extended GBK further, and all three are backward compatible. Understanding GB2312 means understanding the entire lineage of Simplified Chinese text encoding.

Structure and Coverage

GB2312 organizes its characters in a grid of 94 "zones" (区, qū) by 94 "positions" (位, wèi), called the QB/T 1988 zone-position system:

Zones Content
1–9 Non-Chinese: symbols, numerals, Latin, Greek, Cyrillic, Japanese kana, pinyin
10–15 Unused
16–55 Level 1 Chinese characters (3,755 chars, ordered by pinyin)
56–87 Level 2 Chinese characters (3,008 chars, ordered by radical)
88–94 Unused

Level 1 characters cover the most frequently used vocabulary and are sufficient for everyday modern Chinese text. Level 2 characters include rare characters, place names, and specialized vocabulary.

EUC-CN vs. HZ

GB2312 itself defines a zone-position model. It is most commonly encountered in two transfer encodings:

EUC-CN (Extended Unix Code for Chinese): Each Chinese character becomes 2 bytes where each byte has its high bit set (0xA1–0xFE range). Zone Z, position P encodes as (Z + 0xA0) and (P + 0xA0). ASCII bytes are single-byte as usual.

HZ (Hanzi encoding for Internet messages): An older encoding for email where Chinese characters are escaped with ~{ and ~} markers, avoiding 8-bit bytes entirely for compatibility with 7-bit email gateways. Largely obsolete.

Code Examples

# Python: GB2312 encoding
text = '中文'  # "Chinese language"

# GB2312 (via EUC-CN)
encoded = text.encode('gb2312')
print(encoded)          # b'\xd6\xd0\xce\xc4'
print(len(encoded))     # 4 bytes for 2 characters

# Decoding
b'\xd6\xd0\xce\xc4'.decode('gb2312')  # '中文'

# GBK is a superset of GB2312 — usually safe to use GBK
text.encode('gbk')       # Same bytes for GB2312 chars
text.encode('gb18030')   # Same bytes for GB2312 chars

# Traditional Chinese characters not in GB2312
try:
    '繁'.encode('gb2312')  # Traditional form not in GB2312
except UnicodeEncodeError:
    '繁'.encode('gbk')     # GBK covers it
    '繁'.encode('gb18030') # GB18030 covers it too

The GBK Extension

GBK (汉字内码规范, Guojia Biaozhun Kuozhan, "extended national standard") extends GB2312 to cover 20,902 Chinese characters, including all characters in the CJK Unified Ideographs block of Unicode 1.1, plus Traditional Chinese characters used in mainland China for personal names. GBK uses the same 2-byte structure as GB2312's EUC-CN but expands the second byte range to include values below 0xA1.

Windows code page CP936 is Microsoft's implementation of GBK.

The GB18030 Standard

GB18030 (issued 2000, revised 2005, updated 2022) is a mandatory national standard in China — all software sold in China must support it. GB18030 extends GBK to cover the full Unicode range (U+0000–U+10FFFF) using:

  • 1-byte encoding: 0x00–0x7F (ASCII)
  • 2-byte encoding: GB2312/GBK characters
  • 4-byte encoding: Supplementary Unicode characters

GB18030 is the only national standard that covers all Unicode code points. A properly implemented GB18030 system can represent any character.

Quick Facts

Property Value
Full Name Guójiā Biāozhǔn 2312 (National Standard 2312)
Established 1980, mainland China
Characters 6,763 Chinese + 682 other
Transfer encoding EUC-CN (most common)
Superseded by GBK (1993), GB18030 (2000)
Windows code page CP20936 (for GB2312), CP936 (for GBK)
IANA charset GB2312

Common Pitfalls

Confusing GB2312 with GBK and GB18030. Many systems accept any of these as "Chinese encoding," but they differ in coverage. A character in GBK may not be in GB2312. Always prefer GB18030 for new systems that need Simplified Chinese support, as it covers all Unicode characters.

The EUC-CN / GB2312 ambiguity. The IANA charset name "GB2312" is technically the zone-position standard, but in practice it refers to the EUC-CN transfer encoding. Python's 'gb2312' codec and 'euc_cn' codec both decode GB2312 content; they are aliases.

Big5 vs. GB2312 confusion. Traditional Chinese text from Taiwan uses Big5; Simplified Chinese from mainland China uses GB2312/GBK/GB18030. These are mutually incompatible. Mixing them produces garbage output that happens to consist of valid-looking Chinese characters with completely different meanings — sometimes embarrassingly so.

المصطلحات ذات الصلة

المزيد في الترميز