CJK Unified Ideographs Overview
The CJK Unified Ideographs block (U+4E00–U+9FFF) is one of the largest Unicode blocks, containing over 20,000 Chinese, Japanese, and Korean characters unified into a single set. This guide explains the CJK unification process, the structure of the block, and how to work with CJK characters in software applications.
The CJK Unified Ideographs block (U+4E00–U+9FFF) is the largest and most significant non-alphabetic character block in Unicode. "CJK" stands for Chinese, Japanese, and Korean — three writing systems that share a common pool of logographic characters derived from Chinese. The main block contains 20,902 ideographs, but together with its extension blocks, Unicode encodes over 97,000 Han characters, making it one of the most ambitious character unification efforts in the history of writing.
What Is Han Unification?
The defining feature — and the defining controversy — of this block is the concept of Han unification. The same character concept may have acquired distinct written forms in different regions over centuries of cultural and political separation. For example, the Chinese character for "nation" developed a simplified form (国, used in mainland China and Japan) alongside a traditional form (國, used in Taiwan and Hong Kong).
Before Unicode, each encoding standard treated these as separate characters: GB 2312 (mainland China), Big5 (Taiwan), JIS X 0208 (Japan), and KS X 1001 (Korea) each maintained their own character sets with their own code points. Unicode's Han Unification project determined which variant forms represent the same underlying character and assigned a single code point, leaving font rendering to handle regional variants.
This decision reduced the number of code points needed by tens of thousands but introduced tension: the same code point displays differently depending on whether a Chinese, Japanese, or Korean font is active. A character that looks like 骨 in Japanese might have slightly different stroke details in a Chinese font — not a different character, but a different regional glyph convention.
Block Organization
The main CJK Unified Ideographs block U+4E00–U+9FFF follows a rough ordering by radical and stroke count, based on the Kangxi Dictionary ordering system established in 1716 during the Chinese Qing dynasty. The Kangxi system classifies characters by their primary radical (a semantic or structural component) and then by the number of remaining strokes.
| Block | Range | Count | Unicode Version |
|---|---|---|---|
| CJK Unified Ideographs | U+4E00–U+9FFF | 20,902 | 1.0 (1991) |
| Extension A | U+3400–U+4DBF | 6,592 | 3.0 (1999) |
| Extension B | U+20000–U+2A6DF | 42,718 | 3.1 (2001) |
| Extension C | U+2A700–U+2B73F | 4,149 | 6.0 (2010) |
| Extension D | U+2B740–U+2B81F | 222 | 6.0 (2010) |
| Extension E | U+2B820–U+2CEAF | 5,762 | 8.0 (2015) |
| Extension F | U+2CEB0–U+2EBEF | 7,473 | 10.0 (2017) |
| Extension G | U+30000–U+3134F | 4,939 | 13.0 (2020) |
| Extension H | U+31350–U+323AF | 4,192 | 15.0 (2022) |
| Extension I | U+2EBF0–U+2EE5F | 622 | 15.1 (2023) |
The total across all blocks exceeds 97,000 ideographs — more than any other script in Unicode.
The Radical System
Traditional Chinese lexicography organizes characters by radical — a recurring structural component that often (but not always) hints at a character's semantic category. The Kangxi Dictionary uses 214 radicals. For example:
- Radical 85 氵(water) appears in characters related to water: 海 (sea), 河 (river), 湖 (lake), 泳 (swim)
- Radical 64 扌(hand) appears in action characters: 打 (hit), 拿 (take), 抱 (hold), 指 (point)
- Radical 162 辶(walk/move) in: 道 (road), 送 (send), 近 (near), 运 (move)
After selecting the radical, characters are further organized by stroke count — the number of strokes required to write the remaining non-radical portion. This system, while imperfect (some characters can be assigned to multiple radicals), provides a workable index for traditional dictionaries.
Stroke Order
Chinese characters have conventional stroke orders — specific sequences in which strokes are drawn. Correct stroke order affects the natural flow of handwriting and is required in educational and calligraphic contexts. Unicode does not encode stroke order directly; this information lives in separate databases like the Unicode CJK Stroke Order database, Unihan.
General stroke order principles include: 1. Top to bottom 2. Left to right 3. Horizontal before vertical (with exceptions) 4. Outside before inside 5. Close the enclosure last
Ideographic Description Characters (IDS)
For extremely rare characters not yet encoded in Unicode, the Ideographic Description Sequences (IDS) system provides a way to describe a character's visual structure using composition operators from the Ideographic Description Characters block (U+2FF0–U+2FFF):
- U+2FF0 ⿰ — left-right composition
- U+2FF1 ⿱ — top-bottom composition
- U+2FF2 ⿲ — left-center-right composition
- U+2FF3 ⿳ — top-center-bottom composition
For example, the sequence ⿰木木 describes a character composed of two 木 (wood) components side by side — which happens to be 林 (forest, U+6797).
Han Unification Controversy
The unification decision has been controversial since Unicode's inception. Critics argue:
- Font dependence: The same code point must render differently for Japanese vs. Chinese users, requiring language tagging (the
langHTML attribute or CSSfont-language-override) to display correctly — complexity that plain text doesn't handle. - Character identity: Linguists argue that Chinese 戶 and Japanese 戸 are different characters in their respective writing systems, not variants of the same character.
- Taiwan and mainland China: Simplified and traditional forms of the same character receive different code points only sometimes; the line was drawn inconsistently.
Supporters note that unification was necessary given Unicode's code space constraints in 1991 (when BMP space was considered finite) and that the difficulties are manageable with proper font and language infrastructure.
Frequency and Modern Usage
Of the 97,000+ encoded ideographs, only a few thousand are in common use. A typical modern Chinese news article uses roughly 3,500 unique characters. Japanese high school graduates are expected to know about 2,136 Joyo kanji. The most frequent single ideograph in Chinese is 的 (U+7684, possessive particle), appearing in virtually every text.
Most Extension B and later characters represent historical, dialectal, personal name, or scholarly characters — found in ancient manuscripts, specialized dictionaries, and classical texts but rarely in modern everyday writing.
Block Explorer में और
The Basic Latin block (U+0000–U+007F) is the first Unicode block and covers …
The Latin-1 Supplement block (U+0080–U+00FF) extends ASCII with accented Latin characters for …
The General Punctuation block (U+2000–U+206F) contains typographic spaces, dashes, quotation marks, and …
The Mathematical Operators block (U+2200–U+22FF) contains 256 symbols covering set theory, logic, …
The Arrows block (U+2190–U+21FF) contains 112 arrow characters including simple directional arrows, …
The Dingbats block (U+2700–U+27BF) was created to encode the Zapf Dingbats typeface …
The Miscellaneous Symbols block (U+2600–U+26FF) is one of Unicode's most eclectic, containing …
The Hangul Syllables block (U+AC00–U+D7A3) contains 11,172 precomposed Korean syllable blocks algorithmically …
Emoji in Unicode span multiple blocks across the Supplementary Multilingual Plane, including …
The Currency Symbols block (U+20A0–U+20CF) contains dedicated Unicode characters for currencies that …
The Box Drawing block (U+2500–U+257F) and Block Elements block (U+2580–U+259F) provide characters …
The Enclosed Alphanumerics block (U+2460–U+24FF) contains circled numbers, parenthesized numbers and letters, …
The Geometric Shapes block (U+25A0–U+25FF) and related blocks contain squares, circles, triangles, …
The Musical Symbols block (U+1D100–U+1D1FF) is a Supplementary Multilingual Plane block containing …