📜 Script Stories

Korean Hangul System

Hangul was invented in 1443 by King Sejong as a scientific alphabet where syllable blocks are algorithmically composed from individual jamo (consonants and vowels), a structure Unicode mirrors with both jamo and precomposed syllable encodings. This guide tells the story of Hangul, explains its unique Unicode encoding, and covers Korean text processing.

·

Hangul is widely regarded as one of the most scientifically designed writing systems in human history. Created in 1443 by King Sejong the Great of the Joseon Dynasty, Hangul was purpose-built so that "a wise man can learn it in a morning and even a foolish man can learn it in ten days." Unlike scripts that evolved organically over millennia, Hangul was invented with systematic phonological principles — and Unicode's encoding of Hangul reflects this algorithmic design with remarkable elegance. This guide tells the story of Hangul, explains how Unicode encodes it, and covers the technical details of Korean text processing.

The Invention of Hangul

Before Hangul, Korean was written using Chinese characters (hanja), which were poorly suited to Korean grammar and phonology. Literary Chinese was the language of the court and educated elite, leaving the majority of the population effectively illiterate.

In 1443, King Sejong and a team of scholars at the Hall of Worthies (Jiphyeonjeon) created a new alphabet described in the document Hunminjeongeum ("The Correct Sounds for the Instruction of the People"), published in 1446. The script was revolutionary for several reasons:

  1. Featural design: The shapes of consonant letters are based on the position of the tongue, lips, and throat during pronunciation
  2. Systematic vowels: Vowel letters are composed from three elements representing heaven (dot/short stroke), earth (horizontal line), and human (vertical line)
  3. Syllable blocks: Individual letters (jamo) are arranged into square blocks representing syllables, giving text a visual density comparable to Chinese characters

The Structure of Hangul

Jamo: The Building Blocks

Hangul jamo (자모, letters) consist of consonants and vowels:

14 Basic Consonants:

Jamo Name Sound Unicode (Compatibility)
giyeok g/k U+3131
nieun n U+3134
digeut d/t U+3137
rieul r/l U+3139
mieum m U+3141
bieup b/p U+3142
siot s U+3145
ieung ng/silent U+3147
jieut j U+3148
chieut ch U+314A
kieuk k U+314B
tieut t U+314C
pieup p U+314D
hieut h U+314E

5 Double Consonants: ㄲ, ㄸ, ㅃ, ㅆ, ㅉ (tense/fortis versions)

10 Basic Vowels:

Jamo Name Sound Unicode (Compatibility)
a /a/ U+314F
ya /ja/ U+3151
eo /ʌ/ U+3153
yeo /jʌ/ U+3155
o /o/ U+3157
yo /jo/ U+3159
u /u/ U+315B
yu /ju/ U+315D
eu /ɯ/ U+3161
i /i/ U+3163

11 Compound Vowels: ㅐ, ㅒ, ㅔ, ㅖ, ㅘ, ㅙ, ㅚ, ㅝ, ㅞ, ㅟ, ㅢ

Syllable Block Composition

Every Korean syllable follows one of two patterns:

  1. CV (Consonant + Vowel): 가 = ㄱ + ㅏ
  2. CVC (Consonant + Vowel + Consonant): 한 = ㅎ + ㅏ + ㄴ

The leading consonant is the choseong (initial), the vowel is the jungseong (medial), and the optional trailing consonant is the jongseong (final). When there is no initial consonant sound, the silent consonant ㅇ serves as a placeholder.

The visual layout of the block depends on the vowel shape:

Vowel Type Layout Example
Vertical vowel (ㅏ, ㅓ, ㅣ...) Consonant left, vowel right 가 (ㄱ + ㅏ)
Horizontal vowel (ㅗ, ㅜ, ㅡ) Consonant top, vowel bottom 고 (ㄱ + ㅗ)
Compound vowel (ㅘ, ㅝ...) Mixed arrangement 과 (ㄱ + ㅘ)

When a final consonant (jongseong) is present, it occupies the bottom of the block: 한 = ㅎ (top) + ㅏ (right) + ㄴ (bottom).

Unicode Encoding of Hangul

Unicode provides three separate encodings for Hangul, each serving a different purpose:

1. Precomposed Hangul Syllables (U+AC00 – U+D7A3)

This is the primary block for modern Korean text. It contains 11,172 precomposed syllable blocks — every possible combination of:

  • 19 leading consonants (choseong)
  • 21 medial vowels (jungseong)
  • 28 trailing consonants (jongseong, including "no trailing consonant")

19 × 21 × 28 = 11,172 syllables.

The block is algorithmically organized, meaning you can compute the components of any syllable from its code point:

def decompose_hangul(syllable: str) -> tuple[int, int, int]:
    # Decompose a precomposed Hangul syllable into LVT indices.
    code = ord(syllable) - 0xAC00
    if not (0 <= code < 11172):
        raise ValueError("Not a Hangul syllable")

    trail = code % 28       # Jongseong index (0 = no trailing)
    code = code // 28
    vowel = code % 21       # Jungseong index
    lead = code // 21       # Choseong index

    return lead, vowel, trail

def compose_hangul(lead: int, vowel: int, trail: int = 0) -> str:
    # Compose a Hangul syllable from LVT indices.
    code = 0xAC00 + (lead * 21 + vowel) * 28 + trail
    return chr(code)

# Example: 한 = lead ㅎ (18) + vowel ㅏ (0) + trail ㄴ (4)
print(decompose_hangul("한"))  # (18, 0, 4)
print(compose_hangul(18, 0, 4))  # 한

This algorithmic structure is unique in Unicode — no other script has such a mathematically regular encoding.

2. Hangul Jamo (U+1100 – U+11FF)

This block contains the conjoining jamo — individual consonant and vowel letters that rendering engines combine into syllable blocks on the fly:

Range Content Count
U+1100 – U+1112 Leading consonants (choseong) 19
U+1161 – U+1175 Medial vowels (jungseong) 21
U+11A8 – U+11C2 Trailing consonants (jongseong) 27
U+1113 – U+115F Old Korean leading consonants Historical
U+1176 – U+11A7 Old Korean medial vowels Historical
U+11C3 – U+11FF Old Korean trailing consonants Historical

When conjoining jamo appear in sequence (L + V or L + V + T), the rendering engine forms them into a syllable block visually. This encoding is essential for representing Old Korean text that uses archaic jamo not covered by the 11,172 precomposed syllables.

3. Hangul Compatibility Jamo (U+3130 – U+318F)

This block contains individual jamo for display as standalone letters (e.g., in dictionaries, linguistics texts, or keyboard labeling). Unlike conjoining jamo, these do not combine into syllable blocks during rendering.

Block Range Purpose
Hangul Compatibility Jamo U+3130 – U+318F Standalone display
Hangul Jamo Extended-A U+A960 – U+A97F Old Korean choseong
Hangul Jamo Extended-B U+D7B0 – U+D7FF Old Korean jungseong/jongseong

Normalization: NFC vs. NFD

The existence of both precomposed syllables and conjoining jamo means the same Korean text can be represented in two ways:

import unicodedata

# NFC: precomposed (standard for Korean text)
nfc = "한글"
print([f"U+{ord(c):04X}" for c in nfc])
# ['U+D55C', 'U+AE00']  — 2 precomposed syllable code points

# NFD: decomposed into conjoining jamo
nfd = unicodedata.normalize("NFD", nfc)
print([f"U+{ord(c):04X}" for c in nfd])
# ['U+1112', 'U+1161', 'U+11AB', 'U+1100', 'U+1173', 'U+11AF']
# ㅎ + ㅏ + ㄴ + ㄱ + ㅡ + ㄹ  — 6 conjoining jamo

# Both render identically: 한글
print(nfc == nfd)  # False — different code points!
print(nfc == unicodedata.normalize("NFC", nfd))  # True

Always normalize Korean text to NFC for storage and comparison. NFC is the standard form used by Korean operating systems, websites, and databases. macOS file systems notoriously use NFD, which causes filename comparison issues with Korean files.

Korean Text Processing

Jamo Extraction

Extracting individual jamo from precomposed syllables is a common operation for Korean search, phonetic analysis, and input:

CHOSEONG = "ㄱㄲㄴㄷㄸㄹㅁㅂㅃㅅㅆㅇㅈㅉㅊㅋㅌㅍㅎ"
JUNGSEONG = "ㅏㅐㅑㅒㅓㅔㅕㅖㅗㅘㅙㅚㅛㅜㅝㅞㅟㅠㅡㅢㅣ"
JONGSEONG = ("", "ㄱ", "ㄲ", "ㄳ", "ㄴ", "ㄵ", "ㄶ", "ㄷ", "ㄹ",
             "ㄺ", "ㄻ", "ㄼ", "ㄽ", "ㄾ", "ㄿ", "ㅀ", "ㅁ",
             "ㅂ", "ㅄ", "ㅅ", "ㅆ", "ㅇ", "ㅈ", "ㅊ", "ㅋ",
             "ㅌ", "ㅍ", "ㅎ")

def extract_jamo(text: str) -> str:
    # Extract individual jamo from Korean text.
    result = []
    for char in text:
        code = ord(char) - 0xAC00
        if 0 <= code < 11172:
            lead = code // (21 * 28)
            vowel = (code // 28) % 21
            trail = code % 28
            result.append(CHOSEONG[lead])
            result.append(JUNGSEONG[vowel])
            if trail > 0:
                result.append(JONGSEONG[trail])
        else:
            result.append(char)
    return "".join(result)

print(extract_jamo("한글"))  # ㅎㅏㄴㄱㅡㄹ

A uniquely Korean feature is choseong search — searching by typing only the initial consonants of each syllable. For example, typing "ㅎㄱ" matches "한글" because ㅎ is the initial of 한 and ㄱ is the initial of 글:

def get_choseong(text: str) -> str:
    # Extract only initial consonants from Korean text.
    result = []
    for char in text:
        code = ord(char) - 0xAC00
        if 0 <= code < 11172:
            lead = code // (21 * 28)
            result.append(CHOSEONG[lead])
        else:
            result.append(char)
    return "".join(result)

def choseong_matches(query: str, target: str) -> bool:
    # Check if a choseong query matches the target string.
    target_choseong = get_choseong(target)
    return query in target_choseong

print(choseong_matches("ㅎㄱ", "한글"))  # True
print(choseong_matches("ㄷㅎ", "대한민국"))  # True

This feature is implemented in virtually every Korean search engine, address book, and autocomplete system.

Sorting Korean Text

Korean collation sorts by syllable block in dictionary order: first by choseong, then jungseong, then jongseong. Because precomposed Hangul syllables (U+AC00–U+D7A3) are arranged in exactly this order, simple code point sorting produces correct Korean dictionary order — a direct benefit of the algorithmic encoding:

words = ["바나나", "가나다", "사과", "나무"]
sorted_words = sorted(words)
print(sorted_words)  # ['가나다', '나무', '바나나', '사과'] — correct!

JavaScript Considerations

// Hangul syllable detection
function isHangulSyllable(char) {
  const code = char.codePointAt(0);
  return code >= 0xAC00 && code <= 0xD7A3;
}

// Decompose syllable
function decomposeHangul(syllable) {
  const code = syllable.codePointAt(0) - 0xAC00;
  const trail = code % 28;
  const vowel = Math.floor(code / 28) % 21;
  const lead = Math.floor(code / (28 * 21));
  return { lead, vowel, trail };
}

// String length is straightforward — each precomposed syllable = 1 code unit
console.log("한글".length); // 2

macOS NFD Problem

Apple's HFS+ and APFS file systems store filenames in a variant of NFD normalization. This means a file named "한글.txt" created on macOS is stored as a sequence of conjoining jamo, not precomposed syllables. When this filename is transferred to Windows or Linux (which expect NFC), comparison and lookup can fail:

import os
import unicodedata

# Filenames from macOS may be in NFD
for name in os.listdir("."):
    normalized = unicodedata.normalize("NFC", name)
    if name != normalized:
        print(f"NFD filename detected: {name!r}")
        print(f"  NFC equivalent: {normalized!r}")

Always normalize filenames to NFC when processing Korean text across platforms.

Key Takeaways

  • Hangul is a featural alphabet invented in 1443, where letter shapes encode phonetic features and jamo combine into syllable blocks.
  • Unicode provides three encodings: precomposed syllables (11,172 at U+AC00–U+D7A3), conjoining jamo (U+1100–U+11FF), and compatibility jamo (U+3130–U+318F).
  • The precomposed block is algorithmically structured: syllable = 0xAC00 + (lead
  • 21 + vowel) * 28 + trail — enabling decomposition/composition without lookup tables.
  • NFC normalization is essential for Korean text — macOS uses NFD for filenames, causing cross-platform comparison issues.
  • Choseong search (초성 검색) is a distinctly Korean text feature that relies on extracting initial consonants from the algorithmic encoding.
  • Simple code point sorting produces correct Korean dictionary order, thanks to the mathematical arrangement of the precomposed syllable block.

Script Stories içinde daha fazlası