📜 Script Stories

Japanese Writing Systems

Japanese is unique in using three scripts simultaneously — Hiragana, Katakana, and Kanji (CJK ideographs) — alongside Latin text, making it one of the most complex writing systems to support in Unicode. This guide explains how each Japanese script is encoded in Unicode, how they interact, and what developers need to know about Japanese text handling.

·

Japanese is arguably the most complex writing system in active use today. Where most languages use a single script, Japanese routinely mixes three — Hiragana, Katakana, and Kanji — plus Latin characters (romaji) and Arabic numerals, often within a single sentence. This guide explains how Unicode encodes the Japanese writing system, how the three scripts interact, and what developers need to know about Japanese text processing.

The Three Scripts of Japanese

Kanji (漢字)

Kanji are logographic characters borrowed from Chinese, where each character represents a word or morpheme. Japanese adopted Chinese characters beginning around the 5th century CE, adapting them to fit Japanese grammar and pronunciation.

Key characteristics of Kanji in Japanese:

Property Detail
Origin Chinese characters (hanzi)
Count in regular use ~2,136 (Joyo Kanji list)
Total in Unicode CJK blocks 97,000+
Readings per character Typically 2+ (on'yomi = Chinese-derived, kun'yomi = native Japanese)
Usage Content words (nouns, verb stems, adjective stems)

A single Kanji character can have multiple pronunciations depending on context. For example, 生 can be read as せい (sei), しょう (shou), い (i), う (u), なま (nama), き (ki), は (ha), or お (o) — over a dozen readings in total.

Hiragana (ひらがな)

Hiragana is a phonetic syllabary of 46 base characters (plus diacritical variants), each representing a mora (a phonological unit). It developed from cursive forms of Kanji during the Heian period (794–1185).

Usage Example
Grammatical particles は (wa), が (ga), を (wo)
Verb and adjective endings 食べる (taberu) — べる is hiragana
Native Japanese words without Kanji きれい (kirei, beautiful)
Furigana (reading aids above Kanji) 漢字(かんじ)

Katakana (カタカナ)

Katakana is a second phonetic syllabary with the same number of characters as Hiragana, but with angular, distinct forms. It developed from fragments of Kanji used in Buddhist monastery annotations.

Usage Example
Foreign loanwords コンピュータ (konpyuuta, "computer")
Scientific/technical terms エネルギー (enerugii, "energy")
Emphasis (like italics) Used for effect in manga and ads
Onomatopoeia ドキドキ (dokidoki, heartbeat sound)

How They Mix

A typical Japanese sentence freely mixes all three scripts:

私はコーヒーを飲みました。
│  │ │        │ │    │
│  │ └Katakana│ └Hiragana (verb ending + past tense)
│  └Hiragana  └Hiragana (particle)
└Kanji        (particle)

Translation: "I drank coffee." The subject (私, watashi) is Kanji, particles (は, を) are Hiragana, the loanword (コーヒー, coffee) is Katakana, and the verb ending (みました) is Hiragana attached to a Kanji stem (飲).

Unicode Encoding of Japanese

Hiragana Block

Block Range Characters
Hiragana U+3040 – U+309F 93 assigned

The block contains the 46 base hiragana, their voiced/semi-voiced variants (が, ぱ, etc.), small kana (ぁ, ぃ, etc.), and special marks like the iteration mark (ゝ) and voiced iteration mark (ゞ).

Katakana Blocks

Block Range Characters
Katakana U+30A0 – U+30FF 96 assigned
Katakana Phonetic Extensions U+31F0 – U+31FF 16 characters
Kana Extended-A U+1B100 – U+1B12F Historical kana
Kana Extended-B U+1AFF0 – U+1AFFF Additional kana
Kana Supplement U+1B000 – U+1B0FF Historical kana

The main Katakana block mirrors Hiragana's structure. The phonetic extensions contain small katakana used in Ainu language transcription.

CJK Unified Ideographs (Kanji)

This is where the complexity — and controversy — lies. Unicode uses Han Unification to encode Chinese, Japanese, and Korean ideographic characters in shared blocks:

Block Range Count Description
CJK Unified Ideographs U+4E00 – U+9FFF 20,992 Core set
CJK Extension A U+3400 – U+4DBF 6,592 Rare characters
CJK Extension B U+20000 – U+2A6DF 42,720 Very rare
CJK Extension C U+2A700 – U+2B73F 4,153 Very rare
CJK Extension D U+2B740 – U+2B81F 222 Very rare
CJK Extension E U+2B820 – U+2CEAF 5,762 Very rare
CJK Extension F U+2CEB0 – U+2EBEF 7,473 Very rare
CJK Extension G U+30000 – U+3134F 4,939 Very rare
CJK Extension H U+31350 – U+323AF 4,192 Very rare
CJK Extension I U+2EBF0 – U+2F7FF 622 Recently added
CJK Compatibility Ideographs U+F900 – U+FAFF 472 Duplicates for compatibility

The Han Unification Controversy

Unicode's most debated design decision is Han Unification — encoding characters that are semantically equivalent across Chinese, Japanese, and Korean as a single code point, even when their visual forms differ by region. For example:

The character meaning "bone" (骨) has slightly different standard forms in Chinese (PRC), Chinese (Taiwan), Japanese, and Korean typography. Unicode assigns one code point (U+9AA8) for all four, relying on fonts and locale settings to render the regionally appropriate glyph.

Critics — especially in Japan — argue that this conflation loses culturally important distinctions. The Unicode Consortium maintains that these are glyph variants (like different typefaces), not different characters. Ideographic Variation Sequences (IVS) using variation selectors (U+FE00–U+FE0F, U+E0100–U+E01EF) provide a mechanism to request a specific glyph variant when needed.

Halfwidth and Fullwidth Forms

Unicode includes a compatibility block for fullwidth and halfwidth variants:

Block Range Purpose
Halfwidth and Fullwidth Forms U+FF00 – U+FFEF Compatibility with legacy encodings

Japanese text traditionally uses fullwidth characters for Latin letters and digits (A, B, C, 1, 2, 3) to match the width of CJK characters in fixed-width text layouts. Unicode encodes these at U+FF01–U+FF5E. Conversely, halfwidth katakana (U+FF65–U+FF9F) were used in legacy systems with limited display space.

For normalization, NFKC and NFKD map fullwidth Latin characters to their regular ASCII equivalents and halfwidth katakana to fullwidth katakana.

Japanese Text Processing

Character Width and Layout

Japanese is traditionally written vertically (top-to-bottom, columns right-to-left), though horizontal (left-to-right) writing is also common, especially on the web. CSS supports vertical Japanese with:

.vertical-japanese {
  writing-mode: vertical-rl;
  text-orientation: mixed; /* Kanji upright, Latin sideways */
}

CJK characters are ideographic — each occupies a square cell of equal width, known as a "fullwidth" or "em-width" character. This creates the clean grid-like appearance of Japanese text.

Detecting Script Boundaries

Identifying which script a character belongs to is essential for Japanese text processing:

import unicodedata

def japanese_script(char: str) -> str:
    cp = ord(char)
    if 0x3040 <= cp <= 0x309F:
        return "Hiragana"
    elif 0x30A0 <= cp <= 0x30FF or 0xFF65 <= cp <= 0xFF9F:
        return "Katakana"
    elif (0x4E00 <= cp <= 0x9FFF or 0x3400 <= cp <= 0x4DBF
          or 0x20000 <= cp <= 0x2FA1F or 0xF900 <= cp <= 0xFAFF):
        return "Kanji"
    elif 0x0020 <= cp <= 0x007E:
        return "ASCII"
    elif 0xFF01 <= cp <= 0xFF5E:
        return "Fullwidth Latin"
    else:
        return "Other"

text = "私はコーヒーを飲みました"
for ch in text:
    print(f"{ch} → {japanese_script(ch)}")

Input Methods (IME)

Typing Japanese on a computer requires an Input Method Editor (IME) because there is no practical way to have a key for every Kanji. The typical workflow:

  1. Type the pronunciation in romaji (e.g., "nihongo")
  2. The IME converts to hiragana: にほんご
  3. Press space to see Kanji candidates: 日本語, 日本後, etc.
  4. Select the correct conversion and press Enter

This means Japanese strings are composed through a multi-step process that applications must support via the IME API (TSF on Windows, IBus/Fcitx on Linux, macOS Input Sources).

Line Breaking and Word Segmentation

Japanese does not use spaces between words, similar to Thai. However, the rules for line breaking differ — Japanese allows breaks between most character pairs, with specific prohibitions (e.g., never break before a closing bracket or period, never break after an opening bracket). The Unicode Line Breaking Algorithm (UAX #14) encodes these rules.

For word segmentation, MeCab is the most widely used Japanese morphological analyzer:

import MeCab

tagger = MeCab.Tagger("-Owakati")  # -Owakati: space-delimited output
result = tagger.parse("東京都に住んでいます")
print(result)  # "東京 都 に 住ん で い ます\n"

Encoding History

Before Unicode, Japanese computing relied on several encoding standards:

Encoding Era Description
JIS X 0208 1978+ ~6,900 characters, the foundation
Shift_JIS 1980s Microsoft/PC encoding, variable-width
EUC-JP 1980s Unix encoding
ISO-2022-JP 1990s Email encoding (7-bit, escape sequences)
UTF-8 2000s+ Now dominant on web and in modern systems

Legacy data in Shift_JIS or EUC-JP is still encountered in older Japanese systems, and conversion to Unicode can expose mojibake if the source encoding is misidentified.

The Yen Sign Problem

One notorious legacy issue: in Shift_JIS, the byte 0x5C maps to the yen sign (¥), not the backslash. This means file paths on Japanese Windows systems historically displayed as C:¥Users¥名前 instead of C:\Users\名前. Unicode separates these: U+005C is REVERSE SOLIDUS (backslash) and U+00A5 is YEN SIGN. Converting Shift_JIS text to Unicode requires awareness of this mapping.

Ruby Annotations (Furigana)

Japanese text frequently includes furigana — small hiragana above (or beside, in vertical text) Kanji to indicate pronunciation. HTML supports this with the <ruby> element:

<ruby>漢字<rp>(</rp><rt>かんじ</rt><rp>)</rp></ruby>

Unicode also includes Interlinear Annotation characters (U+FFF9–U+FFFB), though HTML ruby is far more practical for web content.

Key Takeaways

  • Japanese uses three scripts simultaneously: Hiragana (phonetic, grammatical), Katakana (loanwords, emphasis), and Kanji (logographic, content words).
  • Unicode encodes Hiragana (U+3040–U+309F) and Katakana (U+30A0–U+30FF) in dedicated blocks, while Kanji share the CJK Unified Ideographs blocks with Chinese and Korean.
  • Han Unification merges semantically equivalent characters from CJK languages into single code points, relying on fonts and locale for regional glyph variants.
  • Fullwidth/halfwidth variants exist for compatibility with legacy Japanese encodings and are normalized away by NFKC/NFKD.
  • Japanese text processing requires an IME for input, MeCab or similar tools for word segmentation, and awareness of CJK-specific line-breaking rules.
  • The encoding history (Shift_JIS, EUC-JP, ISO-2022-JP) means legacy Japanese data requires careful conversion to avoid mojibake.

Script Stories में और