🧱 Block Explorer

CJK Unified Ideographs Overview

The CJK Unified Ideographs block (U+4E00–U+9FFF) is one of the largest Unicode blocks, containing over 20,000 Chinese, Japanese, and Korean characters unified into a single set. This guide explains the CJK unification process, the structure of the block, and how to work with CJK characters in software applications.

·

The CJK Unified Ideographs block (U+4E00–U+9FFF) is the largest and most significant non-alphabetic character block in Unicode. "CJK" stands for Chinese, Japanese, and Korean — three writing systems that share a common pool of logographic characters derived from Chinese. The main block contains 20,902 ideographs, but together with its extension blocks, Unicode encodes over 97,000 Han characters, making it one of the most ambitious character unification efforts in the history of writing.

What Is Han Unification?

The defining feature — and the defining controversy — of this block is the concept of Han unification. The same character concept may have acquired distinct written forms in different regions over centuries of cultural and political separation. For example, the Chinese character for "nation" developed a simplified form (国, used in mainland China and Japan) alongside a traditional form (國, used in Taiwan and Hong Kong).

Before Unicode, each encoding standard treated these as separate characters: GB 2312 (mainland China), Big5 (Taiwan), JIS X 0208 (Japan), and KS X 1001 (Korea) each maintained their own character sets with their own code points. Unicode's Han Unification project determined which variant forms represent the same underlying character and assigned a single code point, leaving font rendering to handle regional variants.

This decision reduced the number of code points needed by tens of thousands but introduced tension: the same code point displays differently depending on whether a Chinese, Japanese, or Korean font is active. A character that looks like 骨 in Japanese might have slightly different stroke details in a Chinese font — not a different character, but a different regional glyph convention.

Block Organization

The main CJK Unified Ideographs block U+4E00–U+9FFF follows a rough ordering by radical and stroke count, based on the Kangxi Dictionary ordering system established in 1716 during the Chinese Qing dynasty. The Kangxi system classifies characters by their primary radical (a semantic or structural component) and then by the number of remaining strokes.

Block Range Count Unicode Version
CJK Unified Ideographs U+4E00–U+9FFF 20,902 1.0 (1991)
Extension A U+3400–U+4DBF 6,592 3.0 (1999)
Extension B U+20000–U+2A6DF 42,718 3.1 (2001)
Extension C U+2A700–U+2B73F 4,149 6.0 (2010)
Extension D U+2B740–U+2B81F 222 6.0 (2010)
Extension E U+2B820–U+2CEAF 5,762 8.0 (2015)
Extension F U+2CEB0–U+2EBEF 7,473 10.0 (2017)
Extension G U+30000–U+3134F 4,939 13.0 (2020)
Extension H U+31350–U+323AF 4,192 15.0 (2022)
Extension I U+2EBF0–U+2EE5F 622 15.1 (2023)

The total across all blocks exceeds 97,000 ideographs — more than any other script in Unicode.

The Radical System

Traditional Chinese lexicography organizes characters by radical — a recurring structural component that often (but not always) hints at a character's semantic category. The Kangxi Dictionary uses 214 radicals. For example:

  • Radical 85 氵(water) appears in characters related to water: 海 (sea), 河 (river), 湖 (lake), 泳 (swim)
  • Radical 64 扌(hand) appears in action characters: 打 (hit), 拿 (take), 抱 (hold), 指 (point)
  • Radical 162 辶(walk/move) in: 道 (road), 送 (send), 近 (near), 运 (move)

After selecting the radical, characters are further organized by stroke count — the number of strokes required to write the remaining non-radical portion. This system, while imperfect (some characters can be assigned to multiple radicals), provides a workable index for traditional dictionaries.

Stroke Order

Chinese characters have conventional stroke orders — specific sequences in which strokes are drawn. Correct stroke order affects the natural flow of handwriting and is required in educational and calligraphic contexts. Unicode does not encode stroke order directly; this information lives in separate databases like the Unicode CJK Stroke Order database, Unihan.

General stroke order principles include: 1. Top to bottom 2. Left to right 3. Horizontal before vertical (with exceptions) 4. Outside before inside 5. Close the enclosure last

Ideographic Description Characters (IDS)

For extremely rare characters not yet encoded in Unicode, the Ideographic Description Sequences (IDS) system provides a way to describe a character's visual structure using composition operators from the Ideographic Description Characters block (U+2FF0–U+2FFF):

  • U+2FF0 ⿰ — left-right composition
  • U+2FF1 ⿱ — top-bottom composition
  • U+2FF2 ⿲ — left-center-right composition
  • U+2FF3 ⿳ — top-center-bottom composition

For example, the sequence ⿰木木 describes a character composed of two 木 (wood) components side by side — which happens to be 林 (forest, U+6797).

Han Unification Controversy

The unification decision has been controversial since Unicode's inception. Critics argue:

  1. Font dependence: The same code point must render differently for Japanese vs. Chinese users, requiring language tagging (the lang HTML attribute or CSS font-language-override) to display correctly — complexity that plain text doesn't handle.
  2. Character identity: Linguists argue that Chinese 戶 and Japanese 戸 are different characters in their respective writing systems, not variants of the same character.
  3. Taiwan and mainland China: Simplified and traditional forms of the same character receive different code points only sometimes; the line was drawn inconsistently.

Supporters note that unification was necessary given Unicode's code space constraints in 1991 (when BMP space was considered finite) and that the difficulties are manageable with proper font and language infrastructure.

Frequency and Modern Usage

Of the 97,000+ encoded ideographs, only a few thousand are in common use. A typical modern Chinese news article uses roughly 3,500 unique characters. Japanese high school graduates are expected to know about 2,136 Joyo kanji. The most frequent single ideograph in Chinese is 的 (U+7684, possessive particle), appearing in virtually every text.

Most Extension B and later characters represent historical, dialectal, personal name, or scholarly characters — found in ancient manuscripts, specialized dictionaries, and classical texts but rarely in modern everyday writing.

Block Explorer의 더 많은 가이드

Basic Latin (ASCII) Block

The Basic Latin block (U+0000–U+007F) is the first Unicode block and covers …

Latin-1 Supplement Block

The Latin-1 Supplement block (U+0080–U+00FF) extends ASCII with accented Latin characters for …

General Punctuation Block

The General Punctuation block (U+2000–U+206F) contains typographic spaces, dashes, quotation marks, and …

Mathematical Operators Block

The Mathematical Operators block (U+2200–U+22FF) contains 256 symbols covering set theory, logic, …

Arrows Block

The Arrows block (U+2190–U+21FF) contains 112 arrow characters including simple directional arrows, …

Dingbats Block

The Dingbats block (U+2700–U+27BF) was created to encode the Zapf Dingbats typeface …

Miscellaneous Symbols Block

The Miscellaneous Symbols block (U+2600–U+26FF) is one of Unicode's most eclectic, containing …

Hangul Block

The Hangul Syllables block (U+AC00–U+D7A3) contains 11,172 precomposed Korean syllable blocks algorithmically …

Emoji Blocks Overview

Emoji in Unicode span multiple blocks across the Supplementary Multilingual Plane, including …

Currency Symbols Block

The Currency Symbols block (U+20A0–U+20CF) contains dedicated Unicode characters for currencies that …

Box Drawing & Block Elements Blocks

The Box Drawing block (U+2500–U+257F) and Block Elements block (U+2580–U+259F) provide characters …

Enclosed Alphanumerics Block

The Enclosed Alphanumerics block (U+2460–U+24FF) contains circled numbers, parenthesized numbers and letters, …

Geometric Shapes Blocks

The Geometric Shapes block (U+25A0–U+25FF) and related blocks contain squares, circles, triangles, …

Musical Symbols Block

The Musical Symbols block (U+1D100–U+1D1FF) is a Supplementary Multilingual Plane block containing …