Korean Hangul System
Hangul was invented in 1443 by King Sejong as a scientific alphabet where syllable blocks are algorithmically composed from individual jamo (consonants and vowels), a structure Unicode mirrors with both jamo and precomposed syllable encodings. This guide tells the story of Hangul, explains its unique Unicode encoding, and covers Korean text processing.
Hangul is widely regarded as one of the most scientifically designed writing systems in human history. Created in 1443 by King Sejong the Great of the Joseon Dynasty, Hangul was purpose-built so that "a wise man can learn it in a morning and even a foolish man can learn it in ten days." Unlike scripts that evolved organically over millennia, Hangul was invented with systematic phonological principles — and Unicode's encoding of Hangul reflects this algorithmic design with remarkable elegance. This guide tells the story of Hangul, explains how Unicode encodes it, and covers the technical details of Korean text processing.
The Invention of Hangul
Before Hangul, Korean was written using Chinese characters (hanja), which were poorly suited to Korean grammar and phonology. Literary Chinese was the language of the court and educated elite, leaving the majority of the population effectively illiterate.
In 1443, King Sejong and a team of scholars at the Hall of Worthies (Jiphyeonjeon) created a new alphabet described in the document Hunminjeongeum ("The Correct Sounds for the Instruction of the People"), published in 1446. The script was revolutionary for several reasons:
- Featural design: The shapes of consonant letters are based on the position of the tongue, lips, and throat during pronunciation
- Systematic vowels: Vowel letters are composed from three elements representing heaven (dot/short stroke), earth (horizontal line), and human (vertical line)
- Syllable blocks: Individual letters (jamo) are arranged into square blocks representing syllables, giving text a visual density comparable to Chinese characters
The Structure of Hangul
Jamo: The Building Blocks
Hangul jamo (자모, letters) consist of consonants and vowels:
14 Basic Consonants:
| Jamo | Name | Sound | Unicode (Compatibility) |
|---|---|---|---|
| ㄱ | giyeok | g/k | U+3131 |
| ㄴ | nieun | n | U+3134 |
| ㄷ | digeut | d/t | U+3137 |
| ㄹ | rieul | r/l | U+3139 |
| ㅁ | mieum | m | U+3141 |
| ㅂ | bieup | b/p | U+3142 |
| ㅅ | siot | s | U+3145 |
| ㅇ | ieung | ng/silent | U+3147 |
| ㅈ | jieut | j | U+3148 |
| ㅊ | chieut | ch | U+314A |
| ㅋ | kieuk | k | U+314B |
| ㅌ | tieut | t | U+314C |
| ㅍ | pieup | p | U+314D |
| ㅎ | hieut | h | U+314E |
5 Double Consonants: ㄲ, ㄸ, ㅃ, ㅆ, ㅉ (tense/fortis versions)
10 Basic Vowels:
| Jamo | Name | Sound | Unicode (Compatibility) |
|---|---|---|---|
| ㅏ | a | /a/ | U+314F |
| ㅑ | ya | /ja/ | U+3151 |
| ㅓ | eo | /ʌ/ | U+3153 |
| ㅕ | yeo | /jʌ/ | U+3155 |
| ㅗ | o | /o/ | U+3157 |
| ㅛ | yo | /jo/ | U+3159 |
| ㅜ | u | /u/ | U+315B |
| ㅠ | yu | /ju/ | U+315D |
| ㅡ | eu | /ɯ/ | U+3161 |
| ㅣ | i | /i/ | U+3163 |
11 Compound Vowels: ㅐ, ㅒ, ㅔ, ㅖ, ㅘ, ㅙ, ㅚ, ㅝ, ㅞ, ㅟ, ㅢ
Syllable Block Composition
Every Korean syllable follows one of two patterns:
- CV (Consonant + Vowel): 가 = ㄱ + ㅏ
- CVC (Consonant + Vowel + Consonant): 한 = ㅎ + ㅏ + ㄴ
The leading consonant is the choseong (initial), the vowel is the jungseong (medial), and the optional trailing consonant is the jongseong (final). When there is no initial consonant sound, the silent consonant ㅇ serves as a placeholder.
The visual layout of the block depends on the vowel shape:
| Vowel Type | Layout | Example |
|---|---|---|
| Vertical vowel (ㅏ, ㅓ, ㅣ...) | Consonant left, vowel right | 가 (ㄱ + ㅏ) |
| Horizontal vowel (ㅗ, ㅜ, ㅡ) | Consonant top, vowel bottom | 고 (ㄱ + ㅗ) |
| Compound vowel (ㅘ, ㅝ...) | Mixed arrangement | 과 (ㄱ + ㅘ) |
When a final consonant (jongseong) is present, it occupies the bottom of the block: 한 = ㅎ (top) + ㅏ (right) + ㄴ (bottom).
Unicode Encoding of Hangul
Unicode provides three separate encodings for Hangul, each serving a different purpose:
1. Precomposed Hangul Syllables (U+AC00 – U+D7A3)
This is the primary block for modern Korean text. It contains 11,172 precomposed syllable blocks — every possible combination of:
- 19 leading consonants (choseong)
- 21 medial vowels (jungseong)
- 28 trailing consonants (jongseong, including "no trailing consonant")
19 × 21 × 28 = 11,172 syllables.
The block is algorithmically organized, meaning you can compute the components of any syllable from its code point:
def decompose_hangul(syllable: str) -> tuple[int, int, int]:
# Decompose a precomposed Hangul syllable into LVT indices.
code = ord(syllable) - 0xAC00
if not (0 <= code < 11172):
raise ValueError("Not a Hangul syllable")
trail = code % 28 # Jongseong index (0 = no trailing)
code = code // 28
vowel = code % 21 # Jungseong index
lead = code // 21 # Choseong index
return lead, vowel, trail
def compose_hangul(lead: int, vowel: int, trail: int = 0) -> str:
# Compose a Hangul syllable from LVT indices.
code = 0xAC00 + (lead * 21 + vowel) * 28 + trail
return chr(code)
# Example: 한 = lead ㅎ (18) + vowel ㅏ (0) + trail ㄴ (4)
print(decompose_hangul("한")) # (18, 0, 4)
print(compose_hangul(18, 0, 4)) # 한
This algorithmic structure is unique in Unicode — no other script has such a mathematically regular encoding.
2. Hangul Jamo (U+1100 – U+11FF)
This block contains the conjoining jamo — individual consonant and vowel letters that rendering engines combine into syllable blocks on the fly:
| Range | Content | Count |
|---|---|---|
| U+1100 – U+1112 | Leading consonants (choseong) | 19 |
| U+1161 – U+1175 | Medial vowels (jungseong) | 21 |
| U+11A8 – U+11C2 | Trailing consonants (jongseong) | 27 |
| U+1113 – U+115F | Old Korean leading consonants | Historical |
| U+1176 – U+11A7 | Old Korean medial vowels | Historical |
| U+11C3 – U+11FF | Old Korean trailing consonants | Historical |
When conjoining jamo appear in sequence (L + V or L + V + T), the rendering engine forms them into a syllable block visually. This encoding is essential for representing Old Korean text that uses archaic jamo not covered by the 11,172 precomposed syllables.
3. Hangul Compatibility Jamo (U+3130 – U+318F)
This block contains individual jamo for display as standalone letters (e.g., in dictionaries, linguistics texts, or keyboard labeling). Unlike conjoining jamo, these do not combine into syllable blocks during rendering.
| Block | Range | Purpose |
|---|---|---|
| Hangul Compatibility Jamo | U+3130 – U+318F | Standalone display |
| Hangul Jamo Extended-A | U+A960 – U+A97F | Old Korean choseong |
| Hangul Jamo Extended-B | U+D7B0 – U+D7FF | Old Korean jungseong/jongseong |
Normalization: NFC vs. NFD
The existence of both precomposed syllables and conjoining jamo means the same Korean text can be represented in two ways:
import unicodedata
# NFC: precomposed (standard for Korean text)
nfc = "한글"
print([f"U+{ord(c):04X}" for c in nfc])
# ['U+D55C', 'U+AE00'] — 2 precomposed syllable code points
# NFD: decomposed into conjoining jamo
nfd = unicodedata.normalize("NFD", nfc)
print([f"U+{ord(c):04X}" for c in nfd])
# ['U+1112', 'U+1161', 'U+11AB', 'U+1100', 'U+1173', 'U+11AF']
# ㅎ + ㅏ + ㄴ + ㄱ + ㅡ + ㄹ — 6 conjoining jamo
# Both render identically: 한글
print(nfc == nfd) # False — different code points!
print(nfc == unicodedata.normalize("NFC", nfd)) # True
Always normalize Korean text to NFC for storage and comparison. NFC is the standard form used by Korean operating systems, websites, and databases. macOS file systems notoriously use NFD, which causes filename comparison issues with Korean files.
Korean Text Processing
Jamo Extraction
Extracting individual jamo from precomposed syllables is a common operation for Korean search, phonetic analysis, and input:
CHOSEONG = "ㄱㄲㄴㄷㄸㄹㅁㅂㅃㅅㅆㅇㅈㅉㅊㅋㅌㅍㅎ"
JUNGSEONG = "ㅏㅐㅑㅒㅓㅔㅕㅖㅗㅘㅙㅚㅛㅜㅝㅞㅟㅠㅡㅢㅣ"
JONGSEONG = ("", "ㄱ", "ㄲ", "ㄳ", "ㄴ", "ㄵ", "ㄶ", "ㄷ", "ㄹ",
"ㄺ", "ㄻ", "ㄼ", "ㄽ", "ㄾ", "ㄿ", "ㅀ", "ㅁ",
"ㅂ", "ㅄ", "ㅅ", "ㅆ", "ㅇ", "ㅈ", "ㅊ", "ㅋ",
"ㅌ", "ㅍ", "ㅎ")
def extract_jamo(text: str) -> str:
# Extract individual jamo from Korean text.
result = []
for char in text:
code = ord(char) - 0xAC00
if 0 <= code < 11172:
lead = code // (21 * 28)
vowel = (code // 28) % 21
trail = code % 28
result.append(CHOSEONG[lead])
result.append(JUNGSEONG[vowel])
if trail > 0:
result.append(JONGSEONG[trail])
else:
result.append(char)
return "".join(result)
print(extract_jamo("한글")) # ㅎㅏㄴㄱㅡㄹ
Initial Consonant Search (초성 검색)
A uniquely Korean feature is choseong search — searching by typing only the initial consonants of each syllable. For example, typing "ㅎㄱ" matches "한글" because ㅎ is the initial of 한 and ㄱ is the initial of 글:
def get_choseong(text: str) -> str:
# Extract only initial consonants from Korean text.
result = []
for char in text:
code = ord(char) - 0xAC00
if 0 <= code < 11172:
lead = code // (21 * 28)
result.append(CHOSEONG[lead])
else:
result.append(char)
return "".join(result)
def choseong_matches(query: str, target: str) -> bool:
# Check if a choseong query matches the target string.
target_choseong = get_choseong(target)
return query in target_choseong
print(choseong_matches("ㅎㄱ", "한글")) # True
print(choseong_matches("ㄷㅎ", "대한민국")) # True
This feature is implemented in virtually every Korean search engine, address book, and autocomplete system.
Sorting Korean Text
Korean collation sorts by syllable block in dictionary order: first by choseong, then jungseong, then jongseong. Because precomposed Hangul syllables (U+AC00–U+D7A3) are arranged in exactly this order, simple code point sorting produces correct Korean dictionary order — a direct benefit of the algorithmic encoding:
words = ["바나나", "가나다", "사과", "나무"]
sorted_words = sorted(words)
print(sorted_words) # ['가나다', '나무', '바나나', '사과'] — correct!
JavaScript Considerations
// Hangul syllable detection
function isHangulSyllable(char) {
const code = char.codePointAt(0);
return code >= 0xAC00 && code <= 0xD7A3;
}
// Decompose syllable
function decomposeHangul(syllable) {
const code = syllable.codePointAt(0) - 0xAC00;
const trail = code % 28;
const vowel = Math.floor(code / 28) % 21;
const lead = Math.floor(code / (28 * 21));
return { lead, vowel, trail };
}
// String length is straightforward — each precomposed syllable = 1 code unit
console.log("한글".length); // 2
macOS NFD Problem
Apple's HFS+ and APFS file systems store filenames in a variant of NFD normalization. This means a file named "한글.txt" created on macOS is stored as a sequence of conjoining jamo, not precomposed syllables. When this filename is transferred to Windows or Linux (which expect NFC), comparison and lookup can fail:
import os
import unicodedata
# Filenames from macOS may be in NFD
for name in os.listdir("."):
normalized = unicodedata.normalize("NFC", name)
if name != normalized:
print(f"NFD filename detected: {name!r}")
print(f" NFC equivalent: {normalized!r}")
Always normalize filenames to NFC when processing Korean text across platforms.
Key Takeaways
- Hangul is a featural alphabet invented in 1443, where letter shapes encode phonetic features and jamo combine into syllable blocks.
- Unicode provides three encodings: precomposed syllables (11,172 at U+AC00–U+D7A3), conjoining jamo (U+1100–U+11FF), and compatibility jamo (U+3130–U+318F).
- The precomposed block is algorithmically structured: syllable = 0xAC00 + (lead
- 21 + vowel) * 28 + trail — enabling decomposition/composition without lookup tables.
- NFC normalization is essential for Korean text — macOS uses NFD for filenames, causing cross-platform comparison issues.
- Choseong search (초성 검색) is a distinctly Korean text feature that relies on extracting initial consonants from the algorithmic encoding.
- Simple code point sorting produces correct Korean dictionary order, thanks to the mathematical arrangement of the precomposed syllable block.
Lainnya di Script Stories
Arabic is the third most widely used writing system in the world, …
Devanagari is an abugida script used to write Hindi, Sanskrit, Marathi, and …
Greek is one of the oldest alphabetic writing systems and gave Unicode …
Cyrillic is used to write Russian, Ukrainian, Bulgarian, Serbian, and over 50 …
Hebrew is an abjad script written right-to-left, used for Biblical Hebrew, Modern …
Thai is an abugida script with no spaces between words, complex vowel …
Japanese is unique in using three scripts simultaneously — Hiragana, Katakana, and …
Bengali is an abugida script with over 300 million speakers, used for …
Tamil is one of the oldest living writing systems, with a literary …
The Armenian alphabet was created in 405 AD by the monk Mesrop …
Georgian has three distinct historical scripts — Mkhedruli, Asomtavruli, and Nuskhuri — …
The Ethiopic script (Ge'ez) is an abugida used to write Amharic, Tigrinya, …
Unicode encodes dozens of historic and extinct scripts — from Cuneiform and …
There are hundreds of writing systems in use around the world today, …