What is スクリプト?

文字が属する文字体系（例：ラテン、キリル、漢字）。Unicode 16.0は168個のスクリプトを定義し、Scriptプロパティはセキュリティと混在スクリプト検出に重要です。

セキュリティ

混在スクリプト検出

異なるスクリプトの文字を混在させるテキストを識別します（例：ラテン＋キリル）。ホモグリフ攻撃に対する主要な防御で、ブラウザはこれを使ってPunycode表示をトリガーします。

2025-02-28 · Updated 2025-12-08

What is Mixed-Script Detection?

Mixed-script detection is a security technique that identifies text strings containing characters from more than one Unicode script, flagging them as potentially deceptive. Because legitimate text in most contexts is written in a single script — English in Latin, Russian in Cyrillic, Arabic in the Arabic script — the presence of multiple scripts in a single identifier, domain name, or username is a strong signal of a spoofing attempt.

Unicode Technical Report #39 (Unicode Security Mechanisms) formalizes mixed-script detection as one of the primary defenses against homoglyph and confusable attacks.

Unicode Scripts

Unicode organizes characters into scripts — named systems of writing associated with particular languages and cultures. Every Unicode character (except a small set of "Common" and "Inherited" script characters) belongs to exactly one script. The Unicode Character Database includes a Scripts.txt property file assigning each code point to its script.

Examples of scripts: Latin, Cyrillic, Greek, Armenian, Hebrew, Arabic, Devanagari, Bengali, CJK (Han), Hiragana, Katakana, Thai, Georgian.

A handful of characters — digits (0–9), punctuation like ., -, @ — have script property Common and are allowed in any script context without triggering mixed-script detection.

How Mixed-Script Detection Works

The algorithm examines all characters in a string and collects the set of scripts represented:

Characters with script Common or Inherited are ignored for mixing purposes
Characters with a specific script (Latin, Cyrillic, etc.) are added to the script set
If the resulting set contains more than one script, the string is mixed-script

For example:

paypal — all Latin — single script, clean
раураl — Cyrillic р, а, у, р, а + Latin l — mixed script, flagged
münchen — Latin + Common (no mixing concern) — single script, clean
аррlе — Cyrillic а, р, р + Latin l, е — mixed script, flagged

Augmented Script Sets

Unicode TR39 defines the concept of augmented script sets to handle characters that are legitimately used across scripts. For example, Han characters are used in both Japanese (combined with Hiragana and Katakana) and Chinese. The augmented script sets expand the "allowed combinations" to prevent false positives for legitimate multilingual text such as Japanese.

This means Japanese text containing Hiragana, Katakana, and Han characters is not flagged as mixed-script because all three are in Japan's augmented script set. Only truly suspicious combinations — Latin mixed with Cyrillic, for example — trigger the detection.

Spoof Checks Defined in TR39

Unicode TR39 defines four formal spoof check levels:

Single-script confusable: A single-script string that is confusable with another single-script string (e.g., all-Cyrillic lookalike of a Latin word)
Mixed-script confusable: A string mixing scripts where replacing characters would produce a confusable string in a single script
Whole-script confusable: An entire string in script X that is confusable with a string in script Y
Any-case confusable: The above checks applied case-insensitively

Implementation in Practice

Browser implementations use mixed-script detection to decide whether to display an internationalized domain name as Unicode or fall back to Punycode. Domain registrars apply it to block registration of mixed-script domains. Programming language toolchains use it to warn about suspicious identifiers.

Python 3 uses a variant of this check for source code identifiers. The unicodedata module exposes script information, and third-party libraries like confusable-homoglyphs implement full TR39 checks.

Quick Facts

Property	Value
Governing standard	Unicode TR39 — Unicode Security Mechanisms
Script data file	`Scripts.txt` in Unicode Character Database
Special script values	Common, Inherited (excluded from mixing checks)
Japanese exception	Han + Hiragana + Katakana allowed via augmented sets
Primary defense against	IDN homograph attacks, username spoofing
Browser application	Determines Unicode vs. Punycode URL rendering
Related concept	Whole-script confusable, confusables dataset

セキュリティのその他の用語

Bidi Text Attack

Exploiting Unicode bidirectional control characters to disguise malicious code or filenames. The …

Bidi オーバーライド攻撃

Unicode双方向オーバーライド文字（U+202A〜U+202E・U+2066〜U+2069）を使って悪意のあるファイル名やコードを偽装する攻撃。'readme‮fdp.exe'は'readmeexe.pdf'と表示されます。

IDN ホモグラフ攻撃

ドメイン名に視覚的に似たUnicode文字を使って正規サイトになりすます攻撃。аpple.com（キリルа）はapple.comに見えます。ブラウザはPunycodeの表示ルールで防御します。

Normalization Attack

Exploiting Unicode normalization to bypass security filters. Input validated before normalization may …

Unicode スプーフィング

Unicode機能を使ってユーザーを欺くこと：偽ドメインのためのホモグリフ・偽ファイル拡張子のためのBidiオーバーライド・隠しテキストのための不可視文字。

ゼロ幅接合子 (ZWJ)

U+200D。隣接する文字の結合を要求します。絵文字シーケンスに不可欠です（👩+ZWJ+💻=👩‍💻）。インド系文字では合字形成を要求します。テキスト境界を隠すためにも使われます。

ゼロ幅非接合子 (ZWNJ)

U+200C。隣接する文字の結合を防ぎます。ペルシャ語/アラビア語で正しい文字形態のために必須で、デーヴァナーガリーで合字を防ぐためにも使われます。

ホモグリフ

異なるスクリプトから来た同一または非常に似て見える文字。例：ラテン'a'とキリル'а'。フィッシング・スプーフィング・ソーシャルエンジニアリング攻撃に使われます。

混同しやすい文字

confusables.txt（UCD）で定義された、視覚的に混同しやすい文字ペアに対するUnicodeの公式用語。ホモグリフより広い概念で、単に似ているだけの文字も含みます。

← 用語集へ