Hangul Block
The Hangul Syllables block (U+AC00–U+D7A3) contains 11,172 precomposed Korean syllable blocks algorithmically derived from 19 initial consonants, 21 vowels, and 28 final consonants. This guide explains the structure of Korean Hangul in Unicode, how syllable composition works, and how to handle Korean text in software.
Korean is written in Hangul, one of the most systematically designed writing systems ever created. Unlike alphabets that evolved organically over centuries, Hangul was invented in 1443 by King Sejong the Great with a clear structural logic that maps directly onto the Unicode blocks that represent it today. Understanding the Hangul Unicode blocks means understanding how a syllabic alphabet can be both phonetic and algorithmically composable.
The Three Hangul Blocks
Unicode organizes Hangul across three distinct blocks, each serving a different purpose:
| Block | Range | Count | Purpose |
|---|---|---|---|
| Hangul Jamo | U+1100–U+11FF | 256 | Combining Jamo for composition |
| Hangul Compatibility Jamo | U+3130–U+318F | 96 | Standalone display Jamo |
| Hangul Syllables | U+AC00–U+D7AF | 11,172 | Precomposed syllable blocks |
Hangul Jamo: The Building Blocks (U+1100–U+11FF)
Hangul Jamo are the individual phonetic components — consonants and vowels — that combine to form syllable blocks. The block is divided into three classes:
- Choseong (leading consonants): U+1100–U+1112 — 19 consonants used at the start of a syllable (ㄱ, ㄴ, ㄷ, ...)
- Jungseong (vowels): U+1161–U+1175 — 21 vowels forming the nucleus (ㅏ, ㅐ, ㅑ, ...)
- Jongseong (trailing consonants): U+11A8–U+11C2 — 28 consonants (including the null coda) used at the end
These Jamo are combining characters meant for algorithmic composition. When a renderer sees a choseong followed by a jungseong (and optionally a jongseong), it stacks them visually into a single syllable block.
Example: 한 is composed as: - ᄒ U+1112 (Choseong Hieuh) - ᅡ U+1161 (Jungseong A) - ᆫ U+11AB (Jongseong Nieun)
The Algorithmic Composition Formula
The most remarkable feature of the Hangul Syllables block is that every one of its 11,172 code points can be derived mathematically. Unicode defines the syllable index as:
SIndex = (LIndex × 21 + VIndex) × 28 + TIndex
SyllableCodePoint = U+AC00 + SIndex
Where:
- LIndex = index of the leading consonant (0–18, 19 possible)
- VIndex = index of the vowel (0–20, 21 possible)
- TIndex = index of the trailing consonant (0–27, where 0 means no coda)
Total syllables: 19 × 21 × 28 = 11,172
This formula works in both directions. To decompose a precomposed syllable like 글 (U+AE00):
SIndex = U+AE00 - U+AC00 = 512
TIndex = 512 % 28 = 0 (no trailing consonant... wait, let's try 글)
For 글 (U+AE00): - SIndex = 0xAE00 − 0xAC00 = 512 - TIndex = 512 mod 28 = 8 → ᆯ (Rieul) - LVIndex = 512 / 28 = 18 - VIndex = 18 mod 21 = 18 → ᅳ (Eu) - LIndex = 18 / 21 = 0 → ᄀ (Kiyeok)
So 글 = ᄀ + ᅳ + ᆯ, which spells the syllable geul meaning "letter" or "writing."
Hangul Compatibility Jamo (U+3130–U+318F)
This block exists for compatibility with legacy Korean encodings like KS X 1001. While the Jamo in U+1100–U+11FF are combining characters that only render correctly when used in sequence, the Compatibility Jamo are standalone characters that display as individual letters. They are commonly used:
- In dictionaries and educational materials to show isolated consonants and vowels
- In keyboard input method displays
- For labeling consonant and vowel charts
Key examples: - ㄱ U+3131 (Hangul Letter Kiyeok) — standalone form of the consonant - ㅏ U+3161 (Hangul Letter A) — standalone vowel
Note that ㄱ (U+3131) and ᄀ (U+1100) look identical but are different code points with different properties. Compatibility Jamo have the Unicode property Hangul_Syllable_Type=NA and do not participate in algorithmic syllable composition.
Hangul Syllables (U+AC00–U+D7AF)
The Hangul Syllables block contains all 11,172 precomposed syllable blocks in modern Korean. These are the characters you typically see in everyday Korean text. The block is sorted in dictionary order: all syllables beginning with ᄀ come first, then ᄁ, and so on through the 19 leading consonants.
Some frequently encountered syllables:
| Character | Code Point | Romanization | Meaning |
|---|---|---|---|
| 가 | U+AC00 | ga | go; house (in some contexts) |
| 나 | U+B098 | na | I; me |
| 사 | U+C0AC | sa | four; person; death |
| 한 | U+D55C | han | Korean; great; one |
| 글 | U+AE00 | geul | letter; writing |
Normalization and NFC vs NFD
Hangul normalization is a key topic in text processing. Unicode defines two equivalent representations for most Korean syllables:
- NFC (Composed): A single precomposed code point like 한 (U+D55C)
- NFD (Decomposed): Three combining Jamo like ᄒ U+1112 + ᅡ U+1161 + ᆫ U+11AB
Both represent the same syllable but are different byte sequences. String comparison and search algorithms must normalize to the same form before comparing. In Python:
# import unicodedata
# nfc = unicodedata.normalize('NFC', '\\u1112\\u1161\\u11AB')
# nfd = unicodedata.normalize('NFD', '\\uD55C')
# nfc == nfd # False until both are normalized to same form
Historical Jamo (U+1160–U+11FF extended and U+A960–U+A97F)
Beyond modern Korean, Unicode also encodes historical Jamo used in Old Korean texts. These include archaic consonants and vowels no longer used in contemporary writing. The Hangul Jamo Extended-A (U+A960–U+A97F) and Extended-B (U+D7B0–U+D7FF) blocks cover these historical forms, supporting scholarly work in classical Korean literature.
Practical Tips for Developers
When working with Hangul in code:
- Always normalize to NFC before storing or comparing Korean text
- Use
len()carefully — a composed syllable is one code point, but its NFD form is 2–3 code points - Regex character classes like
[가-힣]match the entire Hangul Syllables block - Sorting Korean text requires locale-aware collation, not simple code point ordering
- The syllable composition algorithm can be used to validate whether a sequence of Jamo forms a valid syllable
The elegance of Hangul's Unicode representation reflects the elegance of the script itself — a perfectly logical, mathematically expressible system for phonetic writing.
Thêm trong Block Explorer
The Basic Latin block (U+0000–U+007F) is the first Unicode block and covers …
The Latin-1 Supplement block (U+0080–U+00FF) extends ASCII with accented Latin characters for …
The General Punctuation block (U+2000–U+206F) contains typographic spaces, dashes, quotation marks, and …
The Mathematical Operators block (U+2200–U+22FF) contains 256 symbols covering set theory, logic, …
The Arrows block (U+2190–U+21FF) contains 112 arrow characters including simple directional arrows, …
The Dingbats block (U+2700–U+27BF) was created to encode the Zapf Dingbats typeface …
The Miscellaneous Symbols block (U+2600–U+26FF) is one of Unicode's most eclectic, containing …
The CJK Unified Ideographs block (U+4E00–U+9FFF) is one of the largest Unicode …
Emoji in Unicode span multiple blocks across the Supplementary Multilingual Plane, including …
The Currency Symbols block (U+20A0–U+20CF) contains dedicated Unicode characters for currencies that …
The Box Drawing block (U+2500–U+257F) and Block Elements block (U+2580–U+259F) provide characters …
The Enclosed Alphanumerics block (U+2460–U+24FF) contains circled numbers, parenthesized numbers and letters, …
The Geometric Shapes block (U+25A0–U+25FF) and related blocks contain squares, circles, triangles, …
The Musical Symbols block (U+1D100–U+1D1FF) is a Supplementary Multilingual Plane block containing …