Writing Systems of the World · 第 3 章

Chinese Characters: 20,000 Years of Writing

CJK characters are the largest single category in Unicode, with tens of thousands of ideographs. This chapter covers the history, the CJK Unification controversy, and the ever-growing Extension blocks.

~5000 字 · ~20 分钟阅读 · · Updated

No writing system tests the limits of Unicode quite like Chinese characters. With over 97,000 characters encoded and tens of thousands more awaiting inclusion, the CJK (Chinese, Japanese, Korean) ideographs represent the largest single block in all of Unicode. They encode not just a writing system but an entire civilization's worth of recorded history — oracle bones, bronze vessels, imperial edicts, literary masterpieces, scientific treatises, and everyday grocery lists — all expressed through a logographic system that has evolved continuously for over three thousand years.

From Oracle Bones to Digital Screens

The earliest confirmed Chinese writing dates to the Shang Dynasty (c. 1250–1050 BCE): inscriptions on animal bones and turtle shells used in divination. These oracle bone script (甲骨文, jiǎgǔwén) characters are already recognizably ancestral to modern Chinese — a remarkable continuity unmatched by any other major writing tradition.

Oracle bone characters were largely pictographic: 日 (sun), 月 (moon), 山 (mountain), 水 (water). Over centuries, characters became more abstract and systematic, incorporating semantic-phonetic components. By the Han Dynasty (206 BCE–220 CE), the majority of characters followed the semantic-phonetic (形声, xíngshēng) principle: one component suggests meaning, another suggests pronunciation. About 80–90% of modern Chinese characters follow this pattern.

The history of script reform in China is long and continuous: - Oracle bone script (甲骨文) → bronze inscriptions (金文) → seal script (篆书) → clerical script (隶书) → regular script (楷书)

The People's Republic of China introduced simplified characters (简体字) in the 1950s–1960s, reducing stroke counts for approximately 2,000 common characters to improve literacy. Taiwan, Hong Kong, Macau, and overseas Chinese communities retain traditional characters (繁體字). This split creates a fundamental encoding challenge: both forms must be represented, yet many simplified and traditional characters map to the same Unicode code point.

Han Unification: The Most Controversial Unicode Decision

When the Unicode Consortium was designing the standard in the late 1980s, they faced a profound dilemma: Chinese, Japanese, and Korean each had encoding systems with thousands of overlapping characters. Should Unicode assign separate code points to what was, in origin, the same character simply because Chinese, Japanese, and Korean fonts rendered it with slightly different stroke styles?

The decision — Han unification — was to merge characters judged to be the same underlying graph into single code points. A character that appears in all three writing traditions gets one code point, not three. The resulting CJK Unified Ideographs block (U+4E00–U+9FFF, originally 20,902 characters) was the core of Unicode 1.0.

The controversy was immediate and has never fully subsided. The same code point U+8FBA can represent the Japanese 辺 (hen, "vicinity"), the traditional Chinese 邊, or the simplified 边 — three visually distinct glyphs. To display the correct glyph, software must know the language tag of the text and use an appropriate font. This creates real-world problems:

  • A Chinese-Japanese bilingual document may display characters incorrectly if the font-language pairing is wrong
  • Search engines must handle the fact that 邊 and 边 may map to the same code point depending on normalization
  • Sorting and collation differ between Chinese, Japanese, and Korean, even for shared characters

Unicode's response has been pragmatic: specify that language tagging via lang attributes and font selection are the responsibility of higher-level protocols, not the character encoding itself.

The CJK Blocks

Unicode has expanded its CJK coverage repeatedly as more characters were needed:

Block Range Characters Addition
CJK Unified Ideographs U+4E00–U+9FFF 20,902+ Unicode 1.0 (1991)
CJK Extension A U+3400–U+4DBF 6,592 Unicode 3.0 (1999)
CJK Extension B U+20000–U+2A6DF 42,711 Unicode 3.1 (2001)
CJK Extension C U+2A700–U+2B73F 4,149 Unicode 6.0 (2010)
CJK Extension D U+2B740–U+2B81F 222 Unicode 6.3 (2013)
CJK Extension E U+2B820–U+2CEAF 5,762 Unicode 8.0 (2015)
CJK Extension F U+2CEB0–U+2EBEF 7,473 Unicode 10.0 (2017)
CJK Extension G U+30000–U+3134F 4,939 Unicode 13.0 (2020)
CJK Extension H U+31350–U+323AF 4,192 Unicode 15.0 (2022)
CJK Extension I U+2EBF0–U+2EE5F 622 Unicode 15.1 (2023)

The Supplementary Ideographic Plane (Plane 2) contains Extensions B through F — tens of thousands of rare characters used in historical documents, classical literature, place names, and personal names. Extension B alone, added in Unicode 3.1, required the introduction of surrogate pairs in UTF-16, since its code points exceed U+FFFF.

Radical-Stroke Indexing

Traditional Chinese dictionaries organize characters by radical (部首, bùshǒu) — a semantic component that serves as the index key — and then by remaining stroke count. The Kangxi Dictionary (1716), an imperial Chinese compilation, established a canonical set of 214 radicals that is still used today.

Unicode includes the Kangxi Radicals block (U+2F00–U+2FD5, 214 characters) and the CJK Radicals Supplement (U+2E80–U+2EFF, 115 characters) — though these are primarily for compatibility and reference; the radicals themselves are unified with their regular character counterparts. The Unicode Ideographic Description Sequences system (U+2FF0–U+2FFB) provides a way to describe the spatial structure of characters in terms of their components, useful for ideographic description and character research.

Encoding History: GB2312, Big5, GB18030

Before Unicode, Chinese computing was fragmented across incompatible encodings:

GB2312 (1981): Mainland China's standard for simplified characters. A two-byte encoding covering 6,763 Chinese characters plus Latin and other symbols. Extended by GBK to cover more characters.

Big5 (1984): Taiwan's standard for traditional characters. Also a two-byte encoding, but incompatible with GB2312 — the same byte sequence could mean different characters depending on which encoding was in use.

GB18030 (2000, updated 2005, 2022): The current mandatory Chinese national standard, now a superset of Unicode. GB18030:2022 requires implementation of all CJK characters including Extensions, plus emoji. All software sold in mainland China must support GB18030.

The incompatibility of GB2312 and Big5 meant that pre-Unicode cross-strait digital communication was fraught with mojibake (garbled text). Unicode's Han unification, whatever its flaws, at least created a common encoding space where simplified and traditional text can coexist without ambiguity in the code point layer — even if visual disambiguation requires font intelligence.

Input Methods

Typing Chinese on a standard keyboard is impossible without an Input Method Editor (IME). The two dominant paradigms:

Pinyin input: The user types the romanization of a character (e.g., "zhong") and the IME presents candidates. Modern Pinyin IMEs use language models to predict the most likely character sequences from context, enabling fluent touch-typing. Google Pinyin, Microsoft New Phonetic, and Sogou Pinyin are popular implementations.

Wubi (五笔字型): A shape-based input method that encodes characters as sequences of their constituent strokes and components. Trained Wubi typists can reach very high speeds because each character maps to a deterministic keystroke sequence without ambiguity. Wubi was dominant in mainland China before predictive Pinyin became powerful enough.

Handwriting and OCR: Touchscreen devices have enabled handwriting input, where users write characters with their finger. Neural network OCR has become highly accurate, making handwriting input practical for mobile users who haven't learned Pinyin.

The Living Script

Chinese characters continue to evolve. New characters are occasionally coined — particularly for personal and place names, chemistry, and internet slang. The ongoing digitization of historical texts continually surfaces rare characters needing encoding. Unicode's IRG (Ideographic Research Group) is the body that coordinates CJK character proposals, a painstaking process that can take years from proposal to inclusion.

The sheer scale of the CJK enterprise — 97,000+ characters and counting, three major national standards, billions of users, millennia of textual history — makes it a singular achievement of international cooperation and the clearest demonstration that Unicode is genuinely building the library of all human writing, not merely the writing of any one civilization.