Writing Systems of the World · Chương 6

Japanese: Three Scripts in One

Japanese uniquely combines three script systems: Hiragana, Katakana, and Kanji. This chapter explores mixed-script usage, fullwidth forms, ruby annotations, and the technical challenges of Japanese text processing.

~4500 từ · ~18 phút đọc · · Updated

Open any Japanese newspaper, novel, or website and you will encounter something unique in the world of writing: three fundamentally different scripts operating simultaneously in the same sentence, each with its own character set, its own historical origins, and its own functional role in contemporary Japanese. The mixture of kanji (Chinese-derived ideographs), hiragana (a phonetic syllabary for native Japanese words and grammar), and katakana (a phonetic syllabary for foreign loanwords and emphasis) is not a system in disarray — it is a richly expressive, highly functional writing system that encodes meaning, origin, and tone simultaneously through script choice.

Three Scripts, Three Stories

Kanji (漢字) arrived in Japan from China, probably via Korea, around the 4th–5th centuries CE. Initially, Japanese was written using Classical Chinese — much as medieval European scholars wrote in Latin regardless of their native language. Over centuries, Japanese scribes adapted kanji to represent the sounds of Japanese, eventually giving birth to the two phonetic scripts.

Hiragana (ひらがな) emerged in the Heian period (794–1185 CE) as a cursive simplification of kanji used phonetically. Court ladies, denied formal Chinese education, developed hiragana for poetry and correspondence — the Genji Monogatari (The Tale of Genji), arguably the world's first novel, was written in hiragana by Lady Murasaki Shikibu around 1000 CE. Today hiragana is the default script for grammatical elements: verb endings, particles, conjunctions, words without standard kanji.

Katakana (カタカナ) developed similarly but from angular portions of kanji, originally used by Buddhist monks as phonetic glosses on Chinese texts. Today katakana serves distinctly different functions: foreign loanwords (koohii コーヒー for "coffee," terebi テレビ for "television"), emphasis (analogous to italics), scientific names, onomatopoeia, and sometimes as a stylistic choice for brand names or foreign-sounding effect.

Unicode Blocks for Japanese

Block Range Count Content
Hiragana U+3040–U+309F 93 46 basic + dakuten/handakuten variants, small forms
Katakana U+30A0–U+30FF 96 46 basic + variants, prolonged sound mark, iteration marks
Halfwidth/Fullwidth Forms U+FF00–U+FFEF 225 Compatibility forms (halfwidth katakana, fullwidth Latin)
CJK Unified Ideographs U+4E00–U+9FFF+ 20,902+ Kanji (shared with Chinese, Korean)
Katakana Phonetic Extensions U+31F0–U+31FF 16 Rare phonetic extensions for Ainu, etc.
Small Kana Extension U+1B130–U+1B16F 38 Small versions for historical and minority language use

The hiragana syllabary encodes the 46 basic syllables of Japanese phonology plus: - Voiced variants marked with dakuten (゛, U+3099): か→が, き→ぎ, etc. - P-variants with handakuten (゜, U+309A): は→ぱ, ひ→ぴ, etc. - Small forms for compound sounds: っ, ぁ, ぃ, ぅ, ぇ, ぉ, ゃ, ゅ, ょ

Fullwidth and Halfwidth: A Historical Artifact

Japanese computing carries the legacy of hardware constraints in two variant character sets:

Fullwidth forms (全角) are characters that occupy the same width as a CJK character — a square space. The Latin letters A–Z (U+FF21–U+FF3A) and digits 0–9 (U+FF10–U+FF19) are fullwidth versions used in contexts where consistent column width is needed (monospace CJK text, forms, code in Japanese documentation).

Halfwidth katakana (半角カタカナ, U+FF65–U+FF9F) were introduced for early Japanese terminal hardware that could only display characters in a single fixed-width grid. These 63 characters encode katakana in half the horizontal space of normal kanji. They're now considered legacy but still appear in old databases, mainframe systems, and some specialized contexts.

Unicode normalization (NFKC) maps halfwidth katakana to their fullwidth equivalents, which is essential for search and comparison operations — matching halfwidth ガード with fullwidth ガード requires normalization.

Ruby and Furigana

One of the most distinctive features of typeset Japanese is ruby (ルビ) — small phonetic annotations placed above or beside kanji to indicate their reading. Also called furigana (振り仮名), these annotations are essential in children's books, newspapers (for rare characters), and any context where readers may not know a character's reading.

In HTML, the <ruby> element provides semantic markup for ruby text:

<ruby>漢字<rt>かんじ</rt></ruby>

In Unicode plain text, ruby has no representation — it's purely a layout/presentation concept. However, the small kana forms (っ, ゃ, ゅ, ょ, and the Extended block's smaller forms) are used in furigana to represent compound sounds compactly.

Japanese IME: The Most Complex Input Problem

Entering Japanese text on a keyboard is a multi-stage process:

  1. Romaji input: The user types a romanized representation (e.g., "nihon" for 日本)
  2. Kana conversion: The IME converts romaji to hiragana (にほん)
  3. Kanji conversion: The IME presents kanji candidates based on context (日本 "Japan" vs. 二本 "two sticks")
  4. Confirmation: The user selects the correct reading and presses Enter

Modern Japanese IMEs use large language models and massive corpora to achieve high conversion accuracy, since Japanese is rife with homophonous words. The word にほん (nihon) could be written 日本, 二本, or various other kanji depending on meaning. IME accuracy is a product quality differentiator — Microsoft IME, Google Japanese Input, and ATOK compete vigorously on conversion quality.

The IME state machine must also track partial syllables: typing "k" alone is not yet a valid syllable, but typing "ka" gives か. This creates a cursor behavior unique to Japanese input where the "preedit string" (uncommitted text) behaves differently from confirmed text.

Text Segmentation Without Spaces

Japanese prose is typically written without spaces between words. This creates a fundamental challenge for any application that needs to break text into words or lines:

Line breaking cannot simply break at spaces (there are none). Instead, Japanese line breaking rules (kinsoku shori, 禁則処理) specify which characters may start a line (not: 。「), which may end a line (not: 」), and how to handle long katakana words.

Word segmentation requires morphological analysis. MeCab, KNP, and Janome are popular Japanese morphological analyzers that segment text into words based on dictionaries and statistical models. These are prerequisites for search engines, text-to-speech, and many NLP applications.

Vertical writing (縦書き, tategaki) is common in traditional Japanese publishing. Unicode and CSS support vertical text orientation; characters may need to rotate 90° (Latin letters rotate, CJK does not), and some punctuation characters have specific vertical forms.

Mixed-Script Text in Practice

A single Japanese sentence might employ all three scripts plus Arabic numerals and Latin letters. Consider:

2024年のiPhone発売日はいつ?

This sentence uses: Arabic numerals (2024), kanji (年, 発売日), Latin (iPhone), hiragana (の, は, いつ), and the Japanese question mark 々. Each segment follows different rendering rules, collation behavior, and input method behavior, yet they must flow together seamlessly.

The complexity of Japanese text handling — three scripts, two phonetic alphabets, complex IME, ruby annotation, vertical layout, word segmentation — has made Japan a forcing function for Unicode's most sophisticated text layout features. Many capabilities that benefit all scripts worldwide were pioneered to support Japanese.