Shift JIS
Embed This Widget
Add the script tag and a data attribute to embed this widget.
Embed via iframe for maximum compatibility.
<iframe src="https://unicodefyi.com/iframe/glossary/shift-jis/" width="420" height="400" frameborder="0" style="border:0;border-radius:10px;max-width:100%" loading="lazy"></iframe>
Paste this URL in WordPress, Medium, or any oEmbed-compatible platform.
https://unicodefyi.com/glossary/shift-jis/
Add a dynamic SVG badge to your README or docs.
[](https://unicodefyi.com/glossary/shift-jis/)
Use the native HTML custom element.
将单字节ASCII/JIS罗马字与双字节JIS X 0208汉字相结合的日语字符编码,仍在传统日语系统中使用。
What is Shift-JIS?
Shift-JIS (Shift Japanese Industrial Standards, also written Shift_JIS, SJIS, or CP932) is a character encoding for Japanese text developed in 1982 by ASCII Corporation and later adopted by Microsoft as Windows code page 932. It uses a mixed single-byte and double-byte approach: ASCII-compatible characters occupy single bytes, while Japanese syllabic scripts (hiragana, katakana) and kanji occupy 2-byte sequences.
Shift-JIS remains significant because it is embedded in millions of legacy files, web pages (particularly Japanese websites from the 1990s and 2000s), game ROMs, and software applications. It is also the encoding of choice for some embedded systems and industrial equipment in Japan.
The Three Scripts of Japanese
Japanese writing uses three interleaved scripts:
- Hiragana (ひらがな): 46 syllabic characters for native Japanese words and grammatical particles
- Katakana (カタカナ): 46 syllabic characters primarily for foreign loan words
- Kanji (漢字): Thousands of Chinese-origin logographic characters for content words
Shift-JIS encodes all three, plus: - Half-width katakana (single-byte, 0xA1–0xDF): a compact form used on older terminals - Full-width ASCII and punctuation - Special symbols, box-drawing characters, and more
How Shift-JIS Encoding Works
The "shift" in the name describes how byte ranges are partitioned:
- 0x00–0x7F: ASCII (same as US-ASCII, minus some assignments)
- 0xA1–0xDF: Half-width katakana (single bytes)
- 0x81–0x9F and 0xE0–0xFC: First bytes of 2-byte kanji/hiragana/katakana sequences
- Second byte: 0x40–0x7E and 0x80–0xFC (note: includes ASCII printable range!)
The shift algorithm converts JIS X 0208 row-column pairs to Shift-JIS byte pairs using a mathematical transformation involving modulo arithmetic — this is what distinguishes it from simpler 2-byte encodings and what makes it harder to parse.
The Security Hazard: Second Bytes in ASCII Range
The most dangerous feature of Shift-JIS is that the second byte of a two-byte sequence can take values 0x40–0x7E — which overlaps with printable ASCII characters including @, A–Z, a–z, and many others. The backslash \ (0x5C) and single quote ' (0x27) are particularly hazardous:
# 0x95 0x5C is a valid Shift-JIS two-byte character (KATAKANA LETTER RE variant)
b'\x95\x5c'.decode('shift_jis') # '表' (look up what this actually decodes to)
# Naive byte-level search for backslash in SJIS text is DANGEROUS
sjis_text = b'\x95\x5c something'
# The 0x5C is NOT a backslash here — it's the second byte of a kanji
sjis_text.find(b'\\') # Returns 1 — WRONG! It's inside a Japanese character
This property caused numerous security vulnerabilities in early Japanese web applications, particularly SQL injection and path traversal attacks, because server-side security filters scanned for \ and ' without understanding the multi-byte structure.
Code Examples
# Python: Shift-JIS encoding
text = 'こんにちは' # "Hello" (hiragana)
encoded = text.encode('shift_jis')
print(encoded) # b'\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd'
print(len(encoded)) # 10 bytes for 5 characters
# Windows variant (CP932) has additional characters
'①'.encode('cp932') # Circled digit — in CP932 but not standard SJIS
# '①'.encode('shift_jis') # May raise UnicodeEncodeError
# Mixed content
mixed = 'Hello こんにちは'
mixed.encode('shift_jis') # ASCII stays single-byte, Japanese becomes 2-byte
# Detecting Shift-JIS vs UTF-8 vs EUC-JP
import chardet
with open('japanese_file.txt', 'rb') as f:
raw = f.read()
detected = chardet.detect(raw)
print(detected['encoding']) # 'SHIFT_JIS', 'EUC-JP', 'UTF-8', etc.
Quick Facts
| Property | Value |
|---|---|
| Full Name | Shift Japanese Industrial Standards |
| Also Known As | Shift_JIS, SJIS, CP932 |
| Developed | 1982 (ASCII Corporation) |
| Bytes per character | 1 (ASCII/half-kana) or 2 (kanji/hiragana/katakana) |
| Windows code page | CP932 (Microsoft extension) |
| IANA name | Shift_JIS |
| Coverage | ~7,000 kanji + hiragana + katakana + symbols |
| Security hazard | ASCII-range second bytes (backslash, quote) |
Common Pitfalls
Using shift_jis vs cp932. Python's shift_jis codec implements the JIS standard strictly. cp932 (or ms932) is Microsoft's extension with additional characters (like circled numbers, special symbols). Japanese Windows files are almost always CP932, not strict Shift-JIS. Use cp932 unless you specifically need standard conformance.
Detecting encoding automatically. Shift-JIS and EUC-JP are both valid byte sequences for many byte patterns. Automatic encoding detection (chardet, cchardet) gives probabilistic results. For Japanese legacy content, always ask the data source which encoding was used.
The Mojibake trap. Japanese Shift-JIS text read as UTF-8 produces a UnicodeDecodeError. Read as Latin-1, it produces mojibake. The characteristic "文字化け" pattern for Shift-JIS-as-Latin-1 involves many characters in the 0x80–0xA0 range.
相关术语
编码 中的更多内容
美国信息交换标准代码。7位编码,涵盖128个字符(0–127),包括控制字符、数字、拉丁字母和基本符号。
Visual art created from text characters, originally limited to the 95 printable …
Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …
主要在台湾和香港使用的繁体中文字符编码,收录约13,000个CJK字符。
扩展二进制编码十进制交换码。IBM大型机编码,字母范围不连续,至今仍用于银行和企业大型机。
基于KS X 1001的韩语字符编码,将韩文音节和汉字映射为双字节序列。
简体中文字符编码系列:GB2312(6,763字)经GBK演化为GB18030,成为与Unicode兼容的中国强制性国家标准。
由IANA维护的字符编码名称官方注册表,用于HTTP Content-Type头和MIME(如charset=utf-8)。
针对不同语言组的8位单字节编码系列,ISO 8859-1(Latin-1)是Unicode前256个码位的基础。
仅覆盖BMP(U+0000–U+FFFF)的废弃固定2字节编码,是UTF-16的前身,无法表示补充字符。