Unicode 標準

Unicode

あらゆる文字システムのすべての文字に固有の番号(コードポイント)を割り当てる普遍的文字エンコーディング規格。バージョン16.0には154,998個の割り当て済み文字が含まれます。

· Updated

What is Unicode?

Unicode is the universal character encoding standard that assigns a unique number — called a code point — to every character in every writing system on Earth. Before Unicode existed, computers relied on hundreds of incompatible encoding systems: Windows-1252 for Western Europe, Shift-JIS for Japanese, GB2312 for Simplified Chinese. Moving text between these systems produced mojibake (文字化け), garbled output caused by each system interpreting the same byte sequence differently.

Unicode solved this by establishing a single, shared namespace: one number, one character, no ambiguity. The standard covers scripts from Latin to Arabic, emoji, mathematical symbols, ancient languages like Linear B, and even private-use zones for custom characters.

How Unicode Works

Unicode separates two concerns that older encodings conflated:

  1. The character repertoire — which characters exist and what their code points are
  2. The encoding form — how those code points are serialized into bytes (UTF-8, UTF-16, UTF-32)

This separation means you can transmit Unicode text in the encoding best suited to your context. UTF-8 is dominant on the web; UTF-16 is used internally by Java, JavaScript, and Windows; UTF-32 offers fixed-width simplicity for internal processing.

The Unicode Standard

The Unicode Standard is a living specification maintained by the Unicode Consortium. Each version adds new characters, scripts, and emoji. Version 16.0 (September 2024) contains 154,998 assigned characters across 168 scripts. The standard defines not just code points, but also:

  • Character properties: General category (letter, digit, punctuation...), bidirectional class, combining class, case mappings, and dozens more
  • Algorithms: Unicode Bidirectional Algorithm (UBA) for mixed-direction text, Unicode Collation Algorithm (UCA) for sorting, line-breaking rules, normalization forms
  • Named sequences: Pre-defined sequences of code points with official names

Concrete Examples

# Python: every string is Unicode by default (Python 3)
s = "Hello, 世界! 🌍"
print(len(s))        # 11 characters
print(s[7])          # 界
print(ord(s[7]))     # 30028 (decimal) = U+754C
print(f"U+{ord(s[7]):04X}")  # U+754C
// JavaScript: strings are UTF-16 internally
const s = "Hello, 世界! 🌍";
console.log(s.length);          // 13 (🌍 counts as 2 UTF-16 code units)
console.log([...s].length);     // 11 (spread iterator counts Unicode scalars)

Common Misconceptions

"Unicode is an encoding" — Unicode is a character set standard; UTF-8, UTF-16, and UTF-32 are the encodings that serialize Unicode code points into bytes.

"Unicode only covers modern scripts" — Unicode includes hundreds of historic scripts (Egyptian Hieroglyphs, Cuneiform, Old Persian) and even some invented scripts (Tengwar proposals exist, though not yet accepted).

"All Unicode characters fit in 2 bytes" — Only the Basic Multilingual Plane (U+0000–U+FFFF) fits in 16 bits. Characters above U+FFFF require 3–4 bytes in UTF-8 or surrogate pairs in UTF-16.

Quick Facts

Property Value
First version Unicode 1.0 (1991)
Current version 16.0 (September 2024)
Total code space 1,114,112 code points (U+0000–U+10FFFF)
Assigned characters (v16.0) 154,998
Number of scripts 168
Maintained by Unicode Consortium
Synchronized standard ISO/IEC 10646
Dominant web encoding UTF-8 (98%+ of websites)

関連用語

Unicode 標準 のその他の用語

CJK(漢字・かな・ハングル)

中国語・日本語・韓国語 — Unicodeにおける統合漢字ブロックと関連スクリプトをまとめた総称。CJK統合漢字は20,992文字以上を含みます。

Han Unification

The process of mapping Chinese, Japanese, and Korean ideographs that share a …

Hangul Jamo

The individual consonant and vowel components (jamo) of the Korean Hangul writing …

ISO 10646 / 万国文字集合

Unicodeと同期している国際標準(ISO/IEC 10646)で、同じ文字目録とコードポイントを定義しますが、Unicodeの追加アルゴリズムやプロパティは含みません。

Unicode Standard Annex (UAX)

Normative or informative documents that are integral parts of the Unicode Standard. …

Unicode Technical Report (UTR)

Informational documents published by the Unicode Consortium covering specific topics like security …

Unicode コンソーシアム

Unicode標準を開発・維持する非営利団体。Apple・Google・Microsoft・Metaなど多くの企業が会員です。

Unicode スカラー値

サロゲートコードポイント(U+D800〜U+DFFF)を除くすべてのコードポイント。実際の文字を表すことができる有効な値の集合で、合計1,112,064個です。

Unicode バージョン

新しい文字・文字体系・機能を追加するUnicode標準の主要リリース。現在のバージョンはUnicode 16.0(2025年9月)です。

Unicode 安定性ポリシー

一度割り当てられた文字のコードポイントと名前は絶対に変更されないことを保証するポリシー。プロパティは改訂される場合がありますが、割り当ては永続的です。