What is 混在スクリプト検出?

異なるスクリプトの文字を混在させるテキストを識別します（例：ラテン＋キリル）。ホモグリフ攻撃に対する主要な防御で、ブラウザはこれを使ってPunycode表示をトリガーします。

プロパティ

Script Extensions

Unicode property listing all scripts that use a character, broader than the single-valued Script property. Common characters like digits have Script=Common but Script_Extensions={many scripts}.

What is the Script_Extensions Property?

In Unicode, the Script_Extensions (scx) property is a refinement of the basic Script (sc) property that more accurately describes which scripts a character belongs to when that character is legitimately used in multiple scripts. While Script assigns each character to exactly one script, Script_Extensions assigns a set of scripts — reflecting the reality that many Unicode characters are shared across multiple writing systems.

Script vs. Script_Extensions

The basic Script property assigns every character to exactly one script value (or to Common or Inherited for characters not specific to any single script). Common includes punctuation, digits, and mathematical symbols. Inherited includes combining marks that take the script of their base character.

This single-assignment model created a problem: many characters are unambiguously part of multiple specific scripts, but Script forced an arbitrary choice. For example, the Hiragana iteration mark (U+3005) is used in both Japanese Hiragana and Katakana contexts. With Script alone, it must be assigned to one, making it look "foreign" in the other context.

Script_Extensions was introduced to solve this. A character with scx={Hiragana, Katakana} belongs legitimately to both scripts. Implementations that care about script fidelity should check whether a character's Script_Extensions set intersects the expected script, not just whether the single Script value matches.

# Using the 'regex' package (supports Script_Extensions)
import regex

# Match any character whose Script_Extensions includes Arabic
arabic_pattern = regex.compile(r'\p{Script_Extensions=Arabic}')
# Also matches characters shared with Arabic like Arabic comma (U+060C)
# which is used in Arabic, Thaana, Hanifi Rohingya, and others

How Digits Are Shared

Decimal digits illustrate the shared-character model perfectly. The ASCII digits 0–9 (U+0030–U+0039) have Script=Common, meaning they belong to no specific script. But they are used in hundreds of scripts. Various scripts also have their own digit forms (Arabic-Indic digits ٠١٢, Devanagari digits ०१२, etc.), and some of these are shared across related scripts.

For instance, the Arabic-Indic digits (U+0660–U+0669) are used in Arabic script texts but also appear in Persian and Urdu texts that use Arabic-script variants. Their Script_Extensions lists {Arabic, Thaana} or similar, reflecting authentic multi-script use.

Security Implications for Mixed-Script Detection

The Script_Extensions property is essential for correct implementation of Unicode Identifier Security (UAX #31, UTR #39). When checking whether a string is "mixed-script" (potentially suspicious), naive use of Script over-counts legitimate combinations as suspicious. A string containing a character with Script=Common next to Arabic characters might appear mixed-script, but if the Common character genuinely belongs to Arabic typography, the combination is legitimate.

The UAX #31 "Moderately Restrictive" and "Highly Restrictive" identifier profiles use Script_Extensions for the key check: a string is single-script if there exists some script S such that every character in the string either has S in its Script_Extensions set, has Script=Common, or has Script=Inherited.

Quick Facts

Property	Value
Property name	Script_Extensions (scx)
Data type	Set of script values (vs. single value for Script)
Special Script values	Common (sc=Zyyy), Inherited (sc=Zinh)
Data file	`ScriptExtensions.txt` in Unicode Character Database
Security use	UAX #31 and UTR #39 mixed-script detection
Python support	`regex` package (`\p{Script_Extensions=Arabic}`)
Introduced	Unicode 6.0 (2010)

プロパティのその他の用語

Age プロパティ

文字が最初に割り当てられたUnicodeバージョン。システムやソフトウェアバージョン間での文字サポートを判断するのに役立ちます。

East Asian Width

Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …

Joining Type

Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …

ケースマッピング

文字を大文字・小文字・タイトルケースに変換するルール。ロケール依存の場合があり（トルコ語のI問題）、1対多のマッピングもあります（ß → SS）。

スクリプト

文字が属する文字体系（例：ラテン、キリル、漢字）。Unicode 16.0は168個のスクリプトを定義し、Scriptプロパティはセキュリティと混在スクリプト検出に重要です。

デフォルト無視文字

サポートしていないプロセスで目に見える効果なく無視できる文字で、異体字セレクター・ゼロ幅文字・言語タグなどが含まれます。

ブロック

名前付きの連続したコードポイント範囲（例：基本ラテン = U+0000〜U+007F）。Unicode 16.0は336個のブロックを定義し、すべてのコードポイントはちょうど1つのブロックに属します。

ミラープロパティ

RTLコンテキストでグリフを水平に反転すべき文字。例：( → )、[ → ]、{ → }、« → »。

一般カテゴリー

すべてのコードポイントを30個のカテゴリ（Lu・Ll・Nd・Soなど）の1つに分類する体系で、7つの主要クラス（文字・記号・数字・句読点・記号・区切り・その他）にグループ化されています。

互換等価

同じ抽象的内容を持つが外観が異なる場合がある2つの文字シーケンス。正規等価より広い概念。例：ﬁ ≈ fi、² ≈ 2。

← 用語集へ