📖 Unicode History & Culture

CJK Unification: Controversy and Compromise

CJK unification was Unicode's decision to assign the same code points to Chinese, Japanese, and Korean characters with the same historical origin but sometimes different modern forms, saving space at the cost of controversy. This article explores the debate around Han unification, why it was done, and the ongoing disagreements about its impact on East Asian language users.

·

When Unicode was being designed in the late 1980s, its creators faced a challenge unlike any other in the standard: what to do about the tens of thousands of Chinese, Japanese, and Korean ideographs that share historical origins but have diverged in form across cultures and centuries. The decision they made — Han Unification — remains the most debated choice in the history of Unicode.

The Problem: One Character or Many?

Chinese (Hanzi), Japanese (Kanji), and Korean (Hanja) writing systems all derive from classical Chinese characters developed over millennia. A single concept — the word for "tree," for example — might be written with what is historically the same character across all three languages.

But over time, China, Japan, and Korea developed their own typographic traditions. The same underlying character might be written with slightly different stroke arrangements, different stroke endings, or minor structural variations depending on national conventions. A Japanese reader might recognize the character but notice it looks "wrong" in Chinese form, and vice versa.

Unicode's Han Unification assigns a single code point to characters that share the same historical origin and semantic meaning, even if they have different regional visual forms. The character for "tree" is one code point regardless of whether it appears in Chinese, Japanese, or Korean text.

Why Unification Was Chosen

The practical argument for unification was powerful. In the early 1990s, estimates of the total number of CJK characters across all national standards ranged from 20,000 to over 80,000. Encoding each national variant separately could have consumed hundreds of thousands of code points — a significant fraction of the 65,536-character 16-bit space that Unicode originally targeted.

Unification allowed the standard to fit within the available space and to avoid the administrative nightmare of maintaining separate national character sets within a single standard. The Unicode designers argued that regional variation was a rendering concern — a font issue — not an encoding issue. The character-not-glyph principle supported this view.

The Controversy

Critics, particularly from Japan, argued that the unification conflated characters that Japanese typographers, educators, and readers understood as distinct. The Japanese writing standard, JIS, had encoded many characters at separate positions specifically because their visual forms were considered meaningfully different. Having those distinctions erased in Unicode felt, to many Japanese users, like a form of cultural erasure.

The practical consequence: a database or document that stores Chinese and Japanese text using Unicode code points cannot automatically display the correct regional variant without additional metadata. The same byte sequence U+6A39 (the tree character) must render differently in Japanese and Chinese contexts — but Unicode itself provides no mechanism within the character stream to specify this.

The Source Separation Rule

The Unicode Standard attempts to limit the controversy through the Source Separation Rule: if a character has a different code point in the source standards used by China (GB), Japan (JIS), Korea (KS), or Taiwan (CNS), it gets a separate Unicode code point.

This rule means that characters with different national standard identities are kept separate, even if they look similar. Characters that had the same code point in all source standards, or that were simply missing from some standards, are candidates for unification.

The Source Separation Rule reduced the conflict but did not eliminate it. Many characters that Japanese users considered distinct were nonetheless unified because their source standard positions happened to overlap with Chinese or Korean standards.

The Ideographic Variation Database

The Unicode Consortium introduced a partial solution in 2007: the Ideographic Variation Database (IVD). This system allows character sequences to specify a particular visual variant using Variation Selectors — special combining characters that modify the preceding ideograph.

For example, the sequence U+9089 U+E0100 specifies not just the character U+9089, but specifically the variant registered as Adobe-Japan1-04555 in the IVD — a particular Japanese form of that character.

The IVD provides a standards-compliant mechanism for distinguishing regional variants without assigning new code points. However, adoption in software is uneven. Not all font rendering systems support IVD sequences, and many applications do not generate them. The IVD is a correct solution that has not been universally implemented.

Ongoing Debate

Han Unification remains contentious four decades after it was designed. The arguments have evolved:

For unification: The alternative — separate national character sets within Unicode — would have created an entirely different standard, possibly one that failed to achieve global adoption. The rendering problem is real but solvable at the font layer.

Against unification: The rendering layer solution has not been consistently implemented. Systems that render Chinese and Japanese text without proper locale handling produce characters that look wrong to native readers. This undermines trust in Unicode as a culturally respectful standard.

The Unicode Consortium has continued to add CJK characters in separate extension blocks (CJK Extension A through Extension I) as national standards have requested encoding of characters not previously in Unicode. As of Unicode 16.0, there are over 97,000 CJK unified ideographs and their extensions — far more than the original designers anticipated.

What Has Changed

Modern systems largely address the rendering issue through locale tagging and font selection. Web pages specify lang="ja" or lang="zh-Hans", and browsers use Japanese or Chinese fonts accordingly, rendering the correct regional variant automatically. OpenType fonts include language-specific substitution rules that swap in the correct glyph for a given character based on the document's declared language.

For most users on modern, properly-configured systems, Han Unification is invisible. The controversy is most acute in specialized publishing, academic typography, and systems that must mix Chinese and Japanese text without reliable locale information.

Han Unification represents a genuine trade-off between universal encoding efficiency and cultural fidelity to regional typographic traditions — a trade-off that Unicode's designers made under real constraints, and that the standard's users continue to negotiate.

Unicode History & Culture 中的更多内容