What is 文字エンコーディング?

文字をデジタル保存・送信用のバイト列にマッピングするシステム。すべてのテキストファイルにはエンコーディングがあり、正しく宣言されているかどうかが重要です。

What is CJK（漢字・かな・ハングル）?

中国語・日本語・韓国語 — Unicodeにおける統合漢字ブロックと関連スクリプトをまとめた総称。CJK統合漢字は20,992文字以上を含みます。

1文字あたり1〜4バイトを使う可変長Unicode エンコーディング。Webの主流エンコーディング（98%以上）で、ASCIIと完全な後方互換性があります。

エンコーディング

EUC-KR

KS X 1001に基づく韓国語文字エンコーディングで、ハングル音節と漢字を2バイトシーケンスにマッピングします。

2021-04-14 · Updated 2024-05-27

What is EUC-KR?

EUC-KR (Extended Unix Code for Korean) is a character encoding for the Korean language based on the KS X 1001 standard (formerly KS C 5601). It uses a double-byte encoding for Hangul syllables and Hanja (Chinese characters used in Korean) while keeping ASCII characters as single bytes.

EUC-KR was the dominant Korean encoding on Unix systems and the internet from the 1990s through the early 2000s. While UTF-8 has largely replaced it for new content, EUC-KR remains essential for reading legacy Korean text, older Korean websites, and data exported from Korean government and enterprise systems.

Korean Writing System

Modern Korean primarily uses:

Hangul (한글): The Korean alphabet, organized into syllabic blocks. Each syllable block combines 2–3 letters (initial consonant, vowel, optional final consonant). There are 11,172 possible Hangul syllable blocks.
Hanja (한자): Chinese-origin characters used in formal writing, names, and classical texts.
Latin alphabet and digits: For technical content, brand names, and international terms.

How EUC-KR Works

EUC-KR encodes the 2,350 most common Hangul syllables defined in KS X 1001 (the "2,350 frequently used Hangul" subset). This is a critical limitation: EUC-KR cannot represent all 11,172 possible Hangul syllabic blocks.

0x00–0x7F: ASCII (single bytes)
0xA1–0xFE, 0xA1–0xFE: Two-byte sequences for KS X 1001 characters (Hangul, Hanja, symbols)

The 2-byte structure: first byte (row indicator) and second byte (column indicator) both in the range 0xA1–0xFE, following the same convention as EUC-CN and EUC-JP.

EUC-KR vs. CP949 (MS949)

Microsoft extended EUC-KR with CP949 (also called MS949 or UHC — Unified Hangul Code) to cover all 11,172 Hangul syllabic blocks. CP949 adds the remaining 8,822 Hangul syllables that EUC-KR cannot represent.

The difference matters: a Korean name written in less common syllables will fail to encode in strict EUC-KR but will work in CP949. Windows Korean systems default to CP949, while Unix/Linux Korean systems have historically used EUC-KR.

# Python encoding example
text = '안녕하세요'  # "Hello" in Korean

# EUC-KR encoding
encoded = text.encode('euc_kr')
print(encoded)        # b'\xbe\xc8\xb3\xe7\xc7\xcf\xbc\xbc\xbf\xe4'
print(len(encoded))   # 10 bytes for 5 syllables

# CP949 superset — handles all Hangul syllables
text.encode('cp949')  # Same bytes for common syllables

# A less common syllable that might fail in strict EUC-KR
try:
    '믜'.encode('euc_kr')    # Uncommon syllable
except UnicodeEncodeError:
    '믜'.encode('cp949')     # CP949 can handle it
    '믜'.encode('utf-8')     # UTF-8 definitely can

# Round-trip check
original = '한국어'
assert original == original.encode('euc_kr').decode('euc_kr')

The 2,350 Hangul Limitation

The KS X 1001:1987 standard chose 2,350 Hangul syllables based on frequency analysis of Korean text. This seemed sufficient at the time but proved inadequate for proper names, dialect words, and historical texts. The limitation caused real problems:

Personal names with uncommon syllables could not be stored in EUC-KR databases.
Government ID systems had to use alternative representations or Latin transliterations.
Korean addresses with rare characters were corrupted in EUC-KR systems.

KS X 1001:1992 extended the Hangul coverage to 11,172 syllables (the complete modern Hangul syllabary), but EUC-KR as originally defined only covers the 1987 standard's 2,350 syllables. CP949 addresses this through its extension scheme.

Quick Facts

Property	Value
Full Name	Extended Unix Code for Korean
Based on	KS X 1001 (formerly KS C 5601)
Bytes per character	1 (ASCII) or 2 (Korean)
Hangul coverage	2,350 syllables (KS X 1001:1987)
Extension	CP949 / MS949 (all 11,172 syllables)
Windows code page	CP949
IANA charset	EUC-KR
Platform	Unix/Linux (historically)

Common Pitfalls

Using EUC-KR when CP949 is needed. Any Korean text involving personal names, rare vocabulary, or complete Hangul coverage requires CP949. In Python, use 'cp949' or 'ms949' rather than 'euc_kr' for files from Korean Windows systems.

Korean government and enterprise data. Older Korean database exports (especially from Oracle and legacy mainframe systems) are often in EUC-KR. Modern systems increasingly use UTF-8, but data migration must handle the 2,350-syllable limitation by mapping unmappable characters through CP949 first.

The ISO 2022-KR encoding. Some Korean email systems use ISO 2022-KR, which is an escape-sequence based encoding (7-bit safe) rather than EUC-KR's 8-bit encoding. They encode the same characters but with different byte representations. Don't confuse them.

エンコーディングのその他の用語

ASCII

米国情報交換標準符号。0〜127の128文字を扱う7ビットエンコーディングで、制御文字・数字・ラテン文字・基本記号を含みます。

ASCII Art

Visual art created from text characters, originally limited to the 95 printable …

Base64

Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …

Big5

主に台湾と香港で使われる繁体字中国語文字エンコーディングで、約13,000のCJK文字をエンコードします。

EBCDIC

拡張二進化十進数コード。文字範囲が連続していないIBMメインフレームエンコーディングで、金融・企業メインフレームで今も使われています。

GB2312 / GB18030

簡体字中国語文字エンコーディングファミリー：GB2312（6,763文字）がGBKを経てGB18030へと発展し、Unicodeと互換性のある中国の国家標準となっています。

IANA 文字セット

IANAが管理する文字エンコーディング名の公式レジストリで、HTTP Content-TypeヘッダーとMIMEで使われます（例：charset=utf-8）。

ISO 8859

異なる言語グループ向けの8ビット1バイトエンコーディングファミリー。ISO 8859-1（Latin-1）はUnicodeの最初の256コードポイントの基礎となりました。

Shift JIS

1バイトのASCII/JISローマ字と2バイトのJIS X 0208漢字を組み合わせた日本語文字エンコーディング。レガシーな日本語システムで今も使われています。

UCS-2

BMP（U+0000〜U+FFFF）のみをカバーする廃止済みの固定2バイトエンコーディング。UTF-16の前身で、補助文字を表現できません。

← 用語集へ