Script Extensions
Unicode property listing all scripts that use a character, broader than the single-valued Script property. Common characters like digits have Script=Common but Script_Extensions={many scripts}.
What is the Script_Extensions Property?
In Unicode, the Script_Extensions (scx) property is a refinement of the basic Script (sc) property that more accurately describes which scripts a character belongs to when that character is legitimately used in multiple scripts. While Script assigns each character to exactly one script, Script_Extensions assigns a set of scripts — reflecting the reality that many Unicode characters are shared across multiple writing systems.
Script vs. Script_Extensions
The basic Script property assigns every character to exactly one script value (or to Common or Inherited for characters not specific to any single script). Common includes punctuation, digits, and mathematical symbols. Inherited includes combining marks that take the script of their base character.
This single-assignment model created a problem: many characters are unambiguously part of multiple specific scripts, but Script forced an arbitrary choice. For example, the Hiragana iteration mark (U+3005) is used in both Japanese Hiragana and Katakana contexts. With Script alone, it must be assigned to one, making it look "foreign" in the other context.
Script_Extensions was introduced to solve this. A character with scx={Hiragana, Katakana} belongs legitimately to both scripts. Implementations that care about script fidelity should check whether a character's Script_Extensions set intersects the expected script, not just whether the single Script value matches.
# Using the 'regex' package (supports Script_Extensions)
import regex
# Match any character whose Script_Extensions includes Arabic
arabic_pattern = regex.compile(r'\p{Script_Extensions=Arabic}')
# Also matches characters shared with Arabic like Arabic comma (U+060C)
# which is used in Arabic, Thaana, Hanifi Rohingya, and others
How Digits Are Shared
Decimal digits illustrate the shared-character model perfectly. The ASCII digits 0–9 (U+0030–U+0039) have Script=Common, meaning they belong to no specific script. But they are used in hundreds of scripts. Various scripts also have their own digit forms (Arabic-Indic digits ٠١٢, Devanagari digits ०१२, etc.), and some of these are shared across related scripts.
For instance, the Arabic-Indic digits (U+0660–U+0669) are used in Arabic script texts but also appear in Persian and Urdu texts that use Arabic-script variants. Their Script_Extensions lists {Arabic, Thaana} or similar, reflecting authentic multi-script use.
Security Implications for Mixed-Script Detection
The Script_Extensions property is essential for correct implementation of Unicode Identifier Security (UAX #31, UTR #39). When checking whether a string is "mixed-script" (potentially suspicious), naive use of Script over-counts legitimate combinations as suspicious. A string containing a character with Script=Common next to Arabic characters might appear mixed-script, but if the Common character genuinely belongs to Arabic typography, the combination is legitimate.
The UAX #31 "Moderately Restrictive" and "Highly Restrictive" identifier profiles use Script_Extensions for the key check: a string is single-script if there exists some script S such that every character in the string either has S in its Script_Extensions set, has Script=Common, or has Script=Inherited.
Quick Facts
| Property | Value |
|---|---|
| Property name | Script_Extensions (scx) |
| Data type | Set of script values (vs. single value for Script) |
| Special Script values | Common (sc=Zyyy), Inherited (sc=Zinh) |
| Data file | ScriptExtensions.txt in Unicode Character Database |
| Security use | UAX #31 and UTR #39 mixed-script detection |
| Python support | regex package (\p{Script_Extensions=Arabic}) |
| Introduced | Unicode 6.0 (2010) |
Связанные термины
Ещё в Свойства
Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …
Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …
Именованный непрерывный диапазон кодовых позиций (например, Basic Latin = U+0000–U+007F). Unicode 16.0 …
Свойство, определяющее поведение символа в двунаправленном тексте (LTR, RTL, слабое, нейтральное). Используется …
Отображение символа на его компоненты. Каноническая декомпозиция сохраняет значение (é → e …
Символы, не имеющие видимого эффекта и игнорируемые процессами, которые их не поддерживают, …
Две последовательности символов, семантически идентичные и трактуемые как равные. Пример: é (U+00E9) …
Числовое значение (0–254), управляющее порядком комбинирующих знаков при канонической декомпозиции и определяющее, …
Воспринимаемый пользователем «символ» — то, что ощущается как единое целое. Может состоять …
Классификация каждой кодовой позиции в одну из 30 категорий (Lu, Ll, Nd, …