セキュリティ

混同しやすい文字

confusables.txt(UCD)で定義された、視覚的に混同しやすい文字ペアに対するUnicodeの公式用語。ホモグリフより広い概念で、単に似ているだけの文字も含みます。

· Updated

What is a Confusable Character?

In Unicode security terminology, a confusable is any character that a human reader might mistake for a different character due to visual similarity. The Unicode Consortium formally defines and tracks confusables in Unicode Technical Report #39 (Unicode Security Mechanisms), maintaining a publicly available data file called confusables.txt as part of the Unicode Character Database.

Unlike strict homoglyphs — which are nearly pixel-perfect matches — confusables is a broader category. It includes characters that are visually similar under normal reading conditions even if a careful side-by-side comparison would reveal differences. The degree of confusion depends heavily on font, rendering size, and the reader's familiarity with the scripts involved.

The Unicode Confusables Dataset

The confusables.txt file maps thousands of Unicode code points to their prototype — the character they most resemble. For example:

  • Cyrillic а (U+0430) maps to Latin a (U+0061)
  • Greek ο (U+03BF) maps to Latin o (U+006F)
  • Fullwidth Latin (U+FF21) maps to Latin A (U+0041)
  • Mathematical bold 𝐚 (U+1D41A) maps to Latin a (U+0061)

This mapping is not symmetric or transitive in the raw data — it is a one-way "looks like" relationship. Security implementations typically build a bidirectional index from this data so that any two characters that share a prototype are considered mutually confusable.

Categories of Confusables

Confusables arise from several distinct sources:

  1. Cross-script lookalikes — Characters from different scripts that developed independently but converged on the same glyph shape (Latin vs. Cyrillic vs. Greek)
  2. Letterlike symbols — Mathematical, technical, and letterlike Unicode blocks contain styled versions of Latin letters (bold, italic, script, fraktur) that are visually indistinguishable from plain letters in some contexts
  3. Fullwidth and halfwidth forms — The Halfwidth and Fullwidth Forms block (U+FF00–U+FFEF) duplicates ASCII characters for CJK compatibility
  4. Digraphs and ligatures — Characters like (U+FB01, fi ligature) can be confused with the two-character sequence "fi"
  5. Digit lookalikes — The digit 0 (U+0030) and uppercase letter O (U+004F) are confusable; so are 1 (U+0031), lowercase l (U+006C), and uppercase I (U+0049)

Security Implications

Confusables are the technical foundation of several attack classes:

  • Phishing domains: Registering a domain with confusable characters to impersonate a legitimate site
  • Username squatting: Creating social media accounts that appear identical to celebrity or brand accounts
  • Code injection via source files: Inserting confusable characters into identifiers so that malicious code passes visual inspection
  • Credential stuffing assistance: Generating variant spellings of passwords or usernames

Defensive Strategies

The Unicode TR39 specification defines confusability checks for identifiers and strings:

  • Single-script confusable check: A string is flagged if it contains characters confusable with characters from a different script while appearing to be single-script
  • Whole-script confusable check: An entire string written in script X that is confusable with a string written in script Y
  • Mixed-script confusable check: A string containing characters from multiple scripts where substitution could create confusion

Modern web frameworks, domain registrars, and identity platforms increasingly apply these checks automatically.

Quick Facts

Property Value
Governing standard Unicode TR39 — Unicode Security Mechanisms
Data file confusables.txt (Unicode Character Database)
Number of confusable mappings 3,000+ code points mapped
Update frequency With each Unicode release
Prototype concept Characters map to a single "representative" lookalike
Related attacks IDN homograph, phishing, code obfuscation
Key distinction from homoglyph Broader — includes near-matches, not just identical glyphs

関連用語

セキュリティ のその他の用語

Bidi Text Attack

Exploiting Unicode bidirectional control characters to disguise malicious code or filenames. The …

Bidi オーバーライド攻撃

Unicode双方向オーバーライド文字(U+202A〜U+202E・U+2066〜U+2069)を使って悪意のあるファイル名やコードを偽装する攻撃。'readme‮fdp.exe'は'readmeexe.pdf'と表示されます。

IDN ホモグラフ攻撃

ドメイン名に視覚的に似たUnicode文字を使って正規サイトになりすます攻撃。аpple.com(キリルа)はapple.comに見えます。ブラウザはPunycodeの表示ルールで防御します。

Normalization Attack

Exploiting Unicode normalization to bypass security filters. Input validated before normalization may …

Unicode スプーフィング

Unicode機能を使ってユーザーを欺くこと:偽ドメインのためのホモグリフ・偽ファイル拡張子のためのBidiオーバーライド・隠しテキストのための不可視文字。

ゼロ幅接合子 (ZWJ)

U+200D。隣接する文字の結合を要求します。絵文字シーケンスに不可欠です(👩+ZWJ+💻=👩‍💻)。インド系文字では合字形成を要求します。テキスト境界を隠すためにも使われます。

ゼロ幅非接合子 (ZWNJ)

U+200C。隣接する文字の結合を防ぎます。ペルシャ語/アラビア語で正しい文字形態のために必須で、デーヴァナーガリーで合字を防ぐためにも使われます。

ホモグリフ

異なるスクリプトから来た同一または非常に似て見える文字。例:ラテン'a'とキリル'а'。フィッシング・スプーフィング・ソーシャルエンジニアリング攻撃に使われます。

混在スクリプト検出

異なるスクリプトの文字を混在させるテキストを識別します(例:ラテン+キリル)。ホモグリフ攻撃に対する主要な防御で、ブラウザはこれを使ってPunycode表示をトリガーします。