Script Extensions
Unicode property listing all scripts that use a character, broader than the single-valued Script property. Common characters like digits have Script=Common but Script_Extensions={many scripts}.
What is the Script_Extensions Property?
In Unicode, the Script_Extensions (scx) property is a refinement of the basic Script (sc) property that more accurately describes which scripts a character belongs to when that character is legitimately used in multiple scripts. While Script assigns each character to exactly one script, Script_Extensions assigns a set of scripts — reflecting the reality that many Unicode characters are shared across multiple writing systems.
Script vs. Script_Extensions
The basic Script property assigns every character to exactly one script value (or to Common or Inherited for characters not specific to any single script). Common includes punctuation, digits, and mathematical symbols. Inherited includes combining marks that take the script of their base character.
This single-assignment model created a problem: many characters are unambiguously part of multiple specific scripts, but Script forced an arbitrary choice. For example, the Hiragana iteration mark (U+3005) is used in both Japanese Hiragana and Katakana contexts. With Script alone, it must be assigned to one, making it look "foreign" in the other context.
Script_Extensions was introduced to solve this. A character with scx={Hiragana, Katakana} belongs legitimately to both scripts. Implementations that care about script fidelity should check whether a character's Script_Extensions set intersects the expected script, not just whether the single Script value matches.
# Using the 'regex' package (supports Script_Extensions)
import regex
# Match any character whose Script_Extensions includes Arabic
arabic_pattern = regex.compile(r'\p{Script_Extensions=Arabic}')
# Also matches characters shared with Arabic like Arabic comma (U+060C)
# which is used in Arabic, Thaana, Hanifi Rohingya, and others
How Digits Are Shared
Decimal digits illustrate the shared-character model perfectly. The ASCII digits 0–9 (U+0030–U+0039) have Script=Common, meaning they belong to no specific script. But they are used in hundreds of scripts. Various scripts also have their own digit forms (Arabic-Indic digits ٠١٢, Devanagari digits ०१२, etc.), and some of these are shared across related scripts.
For instance, the Arabic-Indic digits (U+0660–U+0669) are used in Arabic script texts but also appear in Persian and Urdu texts that use Arabic-script variants. Their Script_Extensions lists {Arabic, Thaana} or similar, reflecting authentic multi-script use.
Security Implications for Mixed-Script Detection
The Script_Extensions property is essential for correct implementation of Unicode Identifier Security (UAX #31, UTR #39). When checking whether a string is "mixed-script" (potentially suspicious), naive use of Script over-counts legitimate combinations as suspicious. A string containing a character with Script=Common next to Arabic characters might appear mixed-script, but if the Common character genuinely belongs to Arabic typography, the combination is legitimate.
The UAX #31 "Moderately Restrictive" and "Highly Restrictive" identifier profiles use Script_Extensions for the key check: a string is single-script if there exists some script S such that every character in the string either has S in its Script_Extensions set, has Script=Common, or has Script=Inherited.
Quick Facts
| Property | Value |
|---|---|
| Property name | Script_Extensions (scx) |
| Data type | Set of script values (vs. single value for Script) |
| Special Script values | Common (sc=Zyyy), Inherited (sc=Zinh) |
| Data file | ScriptExtensions.txt in Unicode Character Database |
| Security use | UAX #31 and UTR #39 mixed-script detection |
| Python support | regex package (\p{Script_Extensions=Arabic}) |
| Introduced | Unicode 6.0 (2010) |
相关术语
字符属性 中的更多内容
字符首次被分配时所在的Unicode版本,有助于判断各系统和软件版本的字符支持情况。
Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …
Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …
将每个码位归入30个类别(Lu、Ll、Nd、So等)之一的分类体系,分为7大类:字母、标记、数字、标点、符号、分隔符和其他。
具有相同抽象内容但外观可能不同的两个字符序列,比规范等价更宽泛,例如fi ≈ fi,² ≈ 2。
将字符映射为其组成部分的过程。规范分解保留语义(é → e + ◌́),兼容分解可能改变语义(fi → fi)。
命名的连续码位范围(如基本拉丁文 = U+0000–U+007F)。Unicode 16.0定义了336个区块,每个码位恰好属于一个区块。
决定字符在双向文本中(LTR、RTL、弱、中性)行为方式的属性,由Unicode双向算法用于确定显示顺序。
由于稳定性策略规定Unicode名称不可更改,因此提供字符的备用名称,用于更正、缩写和别名。
将字符在大写、小写和标题大小写之间转换的规则,可能因区域设置而异(土耳其语I问题),也存在一对多映射(ß → SS)。