Codificación

EUC-KR

Codificación de caracteres coreanos basada en KS X 1001, que mapea sílabas Hangul y Hanja a secuencias de doble byte.

· Updated

What is EUC-KR?

EUC-KR (Extended Unix Code for Korean) is a character encoding for the Korean language based on the KS X 1001 standard (formerly KS C 5601). It uses a double-byte encoding for Hangul syllables and Hanja (Chinese characters used in Korean) while keeping ASCII characters as single bytes.

EUC-KR was the dominant Korean encoding on Unix systems and the internet from the 1990s through the early 2000s. While UTF-8 has largely replaced it for new content, EUC-KR remains essential for reading legacy Korean text, older Korean websites, and data exported from Korean government and enterprise systems.

Korean Writing System

Modern Korean primarily uses:

  • Hangul (한글): The Korean alphabet, organized into syllabic blocks. Each syllable block combines 2–3 letters (initial consonant, vowel, optional final consonant). There are 11,172 possible Hangul syllable blocks.
  • Hanja (한자): Chinese-origin characters used in formal writing, names, and classical texts.
  • Latin alphabet and digits: For technical content, brand names, and international terms.

How EUC-KR Works

EUC-KR encodes the 2,350 most common Hangul syllables defined in KS X 1001 (the "2,350 frequently used Hangul" subset). This is a critical limitation: EUC-KR cannot represent all 11,172 possible Hangul syllabic blocks.

  • 0x00–0x7F: ASCII (single bytes)
  • 0xA1–0xFE, 0xA1–0xFE: Two-byte sequences for KS X 1001 characters (Hangul, Hanja, symbols)

The 2-byte structure: first byte (row indicator) and second byte (column indicator) both in the range 0xA1–0xFE, following the same convention as EUC-CN and EUC-JP.

EUC-KR vs. CP949 (MS949)

Microsoft extended EUC-KR with CP949 (also called MS949 or UHC — Unified Hangul Code) to cover all 11,172 Hangul syllabic blocks. CP949 adds the remaining 8,822 Hangul syllables that EUC-KR cannot represent.

The difference matters: a Korean name written in less common syllables will fail to encode in strict EUC-KR but will work in CP949. Windows Korean systems default to CP949, while Unix/Linux Korean systems have historically used EUC-KR.

# Python encoding example
text = '안녕하세요'  # "Hello" in Korean

# EUC-KR encoding
encoded = text.encode('euc_kr')
print(encoded)        # b'\xbe\xc8\xb3\xe7\xc7\xcf\xbc\xbc\xbf\xe4'
print(len(encoded))   # 10 bytes for 5 syllables

# CP949 superset — handles all Hangul syllables
text.encode('cp949')  # Same bytes for common syllables

# A less common syllable that might fail in strict EUC-KR
try:
    '믜'.encode('euc_kr')    # Uncommon syllable
except UnicodeEncodeError:
    '믜'.encode('cp949')     # CP949 can handle it
    '믜'.encode('utf-8')     # UTF-8 definitely can

# Round-trip check
original = '한국어'
assert original == original.encode('euc_kr').decode('euc_kr')

The 2,350 Hangul Limitation

The KS X 1001:1987 standard chose 2,350 Hangul syllables based on frequency analysis of Korean text. This seemed sufficient at the time but proved inadequate for proper names, dialect words, and historical texts. The limitation caused real problems:

  • Personal names with uncommon syllables could not be stored in EUC-KR databases.
  • Government ID systems had to use alternative representations or Latin transliterations.
  • Korean addresses with rare characters were corrupted in EUC-KR systems.

KS X 1001:1992 extended the Hangul coverage to 11,172 syllables (the complete modern Hangul syllabary), but EUC-KR as originally defined only covers the 1987 standard's 2,350 syllables. CP949 addresses this through its extension scheme.

Quick Facts

Property Value
Full Name Extended Unix Code for Korean
Based on KS X 1001 (formerly KS C 5601)
Bytes per character 1 (ASCII) or 2 (Korean)
Hangul coverage 2,350 syllables (KS X 1001:1987)
Extension CP949 / MS949 (all 11,172 syllables)
Windows code page CP949
IANA charset EUC-KR
Platform Unix/Linux (historically)

Common Pitfalls

Using EUC-KR when CP949 is needed. Any Korean text involving personal names, rare vocabulary, or complete Hangul coverage requires CP949. In Python, use 'cp949' or 'ms949' rather than 'euc_kr' for files from Korean Windows systems.

Korean government and enterprise data. Older Korean database exports (especially from Oracle and legacy mainframe systems) are often in EUC-KR. Modern systems increasingly use UTF-8, but data migration must handle the 2,350-syllable limitation by mapping unmappable characters through CP949 first.

The ISO 2022-KR encoding. Some Korean email systems use ISO 2022-KR, which is an escape-sequence based encoding (7-bit safe) rather than EUC-KR's 8-bit encoding. They encode the same characters but with different byte representations. Don't confuse them.

Términos relacionados

Más en Codificación