What is 문자 인코딩?

문자를 디지털 저장 및 전송을 위한 바이트 시퀀스에 매핑하는 시스템. 모든 텍스트 파일에는 인코딩이 있으며, 올바르게 선언되었는지가 중요합니다.

What is CJK (한중일)?

한중일 — 유니코드에서 통합 한자 블록 및 관련 문자 체계를 아우르는 집합적 용어. CJK 통합 한자는 20,992개 이상의 문자를 포함합니다.

문자당 1~4바이트를 사용하는 가변 길이 유니코드 인코딩. 웹의 지배적 인코딩(웹사이트의 98% 이상)으로 ASCII와 완전히 하위 호환됩니다.

인코딩

Shift JIS

단일 바이트 ASCII/JIS 로만과 이중 바이트 JIS X 0208 한자를 결합한 일본어 문자 인코딩. 레거시 일본 시스템에서 여전히 사용됩니다.

2021-04-05 · Updated 2024-03-19

What is Shift-JIS?

Shift-JIS (Shift Japanese Industrial Standards, also written Shift_JIS, SJIS, or CP932) is a character encoding for Japanese text developed in 1982 by ASCII Corporation and later adopted by Microsoft as Windows code page 932. It uses a mixed single-byte and double-byte approach: ASCII-compatible characters occupy single bytes, while Japanese syllabic scripts (hiragana, katakana) and kanji occupy 2-byte sequences.

Shift-JIS remains significant because it is embedded in millions of legacy files, web pages (particularly Japanese websites from the 1990s and 2000s), game ROMs, and software applications. It is also the encoding of choice for some embedded systems and industrial equipment in Japan.

The Three Scripts of Japanese

Japanese writing uses three interleaved scripts:

Hiragana (ひらがな): 46 syllabic characters for native Japanese words and grammatical particles
Katakana (カタカナ): 46 syllabic characters primarily for foreign loan words
Kanji (漢字): Thousands of Chinese-origin logographic characters for content words

Shift-JIS encodes all three, plus: - Half-width katakana (single-byte, 0xA1–0xDF): a compact form used on older terminals - Full-width ASCII and punctuation - Special symbols, box-drawing characters, and more

How Shift-JIS Encoding Works

The "shift" in the name describes how byte ranges are partitioned:

0x00–0x7F: ASCII (same as US-ASCII, minus some assignments)
0xA1–0xDF: Half-width katakana (single bytes)
0x81–0x9F and 0xE0–0xFC: First bytes of 2-byte kanji/hiragana/katakana sequences
Second byte: 0x40–0x7E and 0x80–0xFC (note: includes ASCII printable range!)

The shift algorithm converts JIS X 0208 row-column pairs to Shift-JIS byte pairs using a mathematical transformation involving modulo arithmetic — this is what distinguishes it from simpler 2-byte encodings and what makes it harder to parse.

The Security Hazard: Second Bytes in ASCII Range

The most dangerous feature of Shift-JIS is that the second byte of a two-byte sequence can take values 0x40–0x7E — which overlaps with printable ASCII characters including @, A–Z, a–z, and many others. The backslash \ (0x5C) and single quote ' (0x27) are particularly hazardous:

# 0x95 0x5C is a valid Shift-JIS two-byte character (KATAKANA LETTER RE variant)
b'\x95\x5c'.decode('shift_jis')  # '表' (look up what this actually decodes to)

# Naive byte-level search for backslash in SJIS text is DANGEROUS
sjis_text = b'\x95\x5c something'
# The 0x5C is NOT a backslash here — it's the second byte of a kanji
sjis_text.find(b'\\')  # Returns 1 — WRONG! It's inside a Japanese character

This property caused numerous security vulnerabilities in early Japanese web applications, particularly SQL injection and path traversal attacks, because server-side security filters scanned for \ and ' without understanding the multi-byte structure.

Code Examples

# Python: Shift-JIS encoding
text = 'こんにちは'  # "Hello" (hiragana)

encoded = text.encode('shift_jis')
print(encoded)         # b'\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd'
print(len(encoded))    # 10 bytes for 5 characters

# Windows variant (CP932) has additional characters
'①'.encode('cp932')    # Circled digit — in CP932 but not standard SJIS
# '①'.encode('shift_jis')  # May raise UnicodeEncodeError

# Mixed content
mixed = 'Hello こんにちは'
mixed.encode('shift_jis')  # ASCII stays single-byte, Japanese becomes 2-byte

# Detecting Shift-JIS vs UTF-8 vs EUC-JP
import chardet
with open('japanese_file.txt', 'rb') as f:
    raw = f.read()
detected = chardet.detect(raw)
print(detected['encoding'])  # 'SHIFT_JIS', 'EUC-JP', 'UTF-8', etc.

Quick Facts

Property	Value
Full Name	Shift Japanese Industrial Standards
Also Known As	Shift_JIS, SJIS, CP932
Developed	1982 (ASCII Corporation)
Bytes per character	1 (ASCII/half-kana) or 2 (kanji/hiragana/katakana)
Windows code page	CP932 (Microsoft extension)
IANA name	Shift_JIS
Coverage	~7,000 kanji + hiragana + katakana + symbols
Security hazard	ASCII-range second bytes (backslash, quote)

Common Pitfalls

Using shift_jis vs cp932. Python's shift_jis codec implements the JIS standard strictly. cp932 (or ms932) is Microsoft's extension with additional characters (like circled numbers, special symbols). Japanese Windows files are almost always CP932, not strict Shift-JIS. Use cp932 unless you specifically need standard conformance.

Detecting encoding automatically. Shift-JIS and EUC-JP are both valid byte sequences for many byte patterns. Automatic encoding detection (chardet, cchardet) gives probabilistic results. For Japanese legacy content, always ask the data source which encoding was used.

The Mojibake trap. Japanese Shift-JIS text read as UTF-8 produces a UnicodeDecodeError. Read as Latin-1, it produces mojibake. The characteristic "文字化け" pattern for Shift-JIS-as-Latin-1 involves many characters in the 0x80–0xA0 range.