What is 文字系统?

字符所属的书写系统（如拉丁文、西里尔文、汉字）。Unicode 16.0定义了168种文字系统，Script属性是安全检测和混合文字系统检测的关键。

What is 混合文字系统检测?

识别混合不同文字系统字符的文本（如拉丁文+西里尔文），是防御同形字攻击的主要手段，浏览器据此触发Punycode显示。

字符属性

Script Extensions

Unicode property listing all scripts that use a character, broader than the single-valued Script property. Common characters like digits have Script=Common but Script_Extensions={many scripts}.

What is the Script_Extensions Property?

In Unicode, the Script_Extensions (scx) property is a refinement of the basic Script (sc) property that more accurately describes which scripts a character belongs to when that character is legitimately used in multiple scripts. While Script assigns each character to exactly one script, Script_Extensions assigns a set of scripts — reflecting the reality that many Unicode characters are shared across multiple writing systems.

Script vs. Script_Extensions

The basic Script property assigns every character to exactly one script value (or to Common or Inherited for characters not specific to any single script). Common includes punctuation, digits, and mathematical symbols. Inherited includes combining marks that take the script of their base character.

This single-assignment model created a problem: many characters are unambiguously part of multiple specific scripts, but Script forced an arbitrary choice. For example, the Hiragana iteration mark (U+3005) is used in both Japanese Hiragana and Katakana contexts. With Script alone, it must be assigned to one, making it look "foreign" in the other context.

Script_Extensions was introduced to solve this. A character with scx={Hiragana, Katakana} belongs legitimately to both scripts. Implementations that care about script fidelity should check whether a character's Script_Extensions set intersects the expected script, not just whether the single Script value matches.

# Using the 'regex' package (supports Script_Extensions)
import regex

# Match any character whose Script_Extensions includes Arabic
arabic_pattern = regex.compile(r'\p{Script_Extensions=Arabic}')
# Also matches characters shared with Arabic like Arabic comma (U+060C)
# which is used in Arabic, Thaana, Hanifi Rohingya, and others

How Digits Are Shared

Decimal digits illustrate the shared-character model perfectly. The ASCII digits 0–9 (U+0030–U+0039) have Script=Common, meaning they belong to no specific script. But they are used in hundreds of scripts. Various scripts also have their own digit forms (Arabic-Indic digits ٠١٢, Devanagari digits ०१२, etc.), and some of these are shared across related scripts.

For instance, the Arabic-Indic digits (U+0660–U+0669) are used in Arabic script texts but also appear in Persian and Urdu texts that use Arabic-script variants. Their Script_Extensions lists {Arabic, Thaana} or similar, reflecting authentic multi-script use.

Security Implications for Mixed-Script Detection

The Script_Extensions property is essential for correct implementation of Unicode Identifier Security (UAX #31, UTR #39). When checking whether a string is "mixed-script" (potentially suspicious), naive use of Script over-counts legitimate combinations as suspicious. A string containing a character with Script=Common next to Arabic characters might appear mixed-script, but if the Common character genuinely belongs to Arabic typography, the combination is legitimate.

The UAX #31 "Moderately Restrictive" and "Highly Restrictive" identifier profiles use Script_Extensions for the key check: a string is single-script if there exists some script S such that every character in the string either has S in its Script_Extensions set, has Script=Common, or has Script=Inherited.

Quick Facts

Property	Value
Property name	Script_Extensions (scx)
Data type	Set of script values (vs. single value for Script)
Special Script values	Common (sc=Zyyy), Inherited (sc=Zinh)
Data file	`ScriptExtensions.txt` in Unicode Character Database
Security use	UAX #31 and UTR #39 mixed-script detection
Python support	`regex` package (`\p{Script_Extensions=Arabic}`)
Introduced	Unicode 6.0 (2010)