EUC-KR
Pengkodean karakter Korea berdasarkan KS X 1001, memetakan suku kata Hangul dan Hanja ke urutan dua-byte.
What is EUC-KR?
EUC-KR (Extended Unix Code for Korean) is a character encoding for the Korean language based on the KS X 1001 standard (formerly KS C 5601). It uses a double-byte encoding for Hangul syllables and Hanja (Chinese characters used in Korean) while keeping ASCII characters as single bytes.
EUC-KR was the dominant Korean encoding on Unix systems and the internet from the 1990s through the early 2000s. While UTF-8 has largely replaced it for new content, EUC-KR remains essential for reading legacy Korean text, older Korean websites, and data exported from Korean government and enterprise systems.
Korean Writing System
Modern Korean primarily uses:
- Hangul (한글): The Korean alphabet, organized into syllabic blocks. Each syllable block combines 2–3 letters (initial consonant, vowel, optional final consonant). There are 11,172 possible Hangul syllable blocks.
- Hanja (한자): Chinese-origin characters used in formal writing, names, and classical texts.
- Latin alphabet and digits: For technical content, brand names, and international terms.
How EUC-KR Works
EUC-KR encodes the 2,350 most common Hangul syllables defined in KS X 1001 (the "2,350 frequently used Hangul" subset). This is a critical limitation: EUC-KR cannot represent all 11,172 possible Hangul syllabic blocks.
- 0x00–0x7F: ASCII (single bytes)
- 0xA1–0xFE, 0xA1–0xFE: Two-byte sequences for KS X 1001 characters (Hangul, Hanja, symbols)
The 2-byte structure: first byte (row indicator) and second byte (column indicator) both in the range 0xA1–0xFE, following the same convention as EUC-CN and EUC-JP.
EUC-KR vs. CP949 (MS949)
Microsoft extended EUC-KR with CP949 (also called MS949 or UHC — Unified Hangul Code) to cover all 11,172 Hangul syllabic blocks. CP949 adds the remaining 8,822 Hangul syllables that EUC-KR cannot represent.
The difference matters: a Korean name written in less common syllables will fail to encode in strict EUC-KR but will work in CP949. Windows Korean systems default to CP949, while Unix/Linux Korean systems have historically used EUC-KR.
# Python encoding example
text = '안녕하세요' # "Hello" in Korean
# EUC-KR encoding
encoded = text.encode('euc_kr')
print(encoded) # b'\xbe\xc8\xb3\xe7\xc7\xcf\xbc\xbc\xbf\xe4'
print(len(encoded)) # 10 bytes for 5 syllables
# CP949 superset — handles all Hangul syllables
text.encode('cp949') # Same bytes for common syllables
# A less common syllable that might fail in strict EUC-KR
try:
'믜'.encode('euc_kr') # Uncommon syllable
except UnicodeEncodeError:
'믜'.encode('cp949') # CP949 can handle it
'믜'.encode('utf-8') # UTF-8 definitely can
# Round-trip check
original = '한국어'
assert original == original.encode('euc_kr').decode('euc_kr')
The 2,350 Hangul Limitation
The KS X 1001:1987 standard chose 2,350 Hangul syllables based on frequency analysis of Korean text. This seemed sufficient at the time but proved inadequate for proper names, dialect words, and historical texts. The limitation caused real problems:
- Personal names with uncommon syllables could not be stored in EUC-KR databases.
- Government ID systems had to use alternative representations or Latin transliterations.
- Korean addresses with rare characters were corrupted in EUC-KR systems.
KS X 1001:1992 extended the Hangul coverage to 11,172 syllables (the complete modern Hangul syllabary), but EUC-KR as originally defined only covers the 1987 standard's 2,350 syllables. CP949 addresses this through its extension scheme.
Quick Facts
| Property | Value |
|---|---|
| Full Name | Extended Unix Code for Korean |
| Based on | KS X 1001 (formerly KS C 5601) |
| Bytes per character | 1 (ASCII) or 2 (Korean) |
| Hangul coverage | 2,350 syllables (KS X 1001:1987) |
| Extension | CP949 / MS949 (all 11,172 syllables) |
| Windows code page | CP949 |
| IANA charset | EUC-KR |
| Platform | Unix/Linux (historically) |
Common Pitfalls
Using EUC-KR when CP949 is needed. Any Korean text involving personal names, rare vocabulary, or complete Hangul coverage requires CP949. In Python, use 'cp949' or 'ms949' rather than 'euc_kr' for files from Korean Windows systems.
Korean government and enterprise data. Older Korean database exports (especially from Oracle and legacy mainframe systems) are often in EUC-KR. Modern systems increasingly use UTF-8, but data migration must handle the 2,350-syllable limitation by mapping unmappable characters through CP949 first.
The ISO 2022-KR encoding. Some Korean email systems use ISO 2022-KR, which is an escape-sequence based encoding (7-bit safe) rather than EUC-KR's 8-bit encoding. They encode the same characters but with different byte representations. Don't confuse them.
Istilah Terkait
Lainnya di Pengkodean
Standar Kode Amerika untuk Pertukaran Informasi (American Standard Code for Information Interchange). …
Visual art created from text characters, originally limited to the 95 printable …
Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …
Pengkodean karakter Tionghoa Tradisional yang digunakan terutama di Taiwan dan Hong Kong, …
Extended Binary Coded Decimal Interchange Code. Pengkodean mainframe IBM dengan rentang huruf …
Keluarga pengkodean karakter Tionghoa Sederhana: GB2312 (6.763 karakter) berkembang menjadi GBK lalu …
Keluarga pengkodean satu-byte 8-bit untuk kelompok bahasa yang berbeda. ISO 8859-1 (Latin-1) …
Registri resmi nama pengkodean karakter yang dikelola oleh IANA, digunakan dalam header …
Sistem yang memetakan karakter ke urutan byte untuk penyimpanan dan transmisi digital. …
Pengkodean karakter Jepang yang menggabungkan ASCII/JIS Roman satu-byte dengan kanji JIS X …