What is Письменность?

Письменность, к которой принадлежит символ (например, Latin, Cyrillic, Han). Unicode 16.0 определяет 168 письменностей; свойство Script важно для безопасности и обнаружения смешанных письменностей.

What is Общая категория?

Классификация каждой кодовой позиции в одну из 30 категорий (Lu, Ll, Nd, So и т. д.), сгруппированных в 7 основных классов: Letter, Mark, Number, Punctuation, Symbol, Separator, Other.

What is Обнаружение смешанных систем письма?

Идентификация текста, смешивающего символы из разных письменностей (например, латиница + кириллица). Основная защита от атак с омоглифами; браузеры используют это для активации отображения Punycode.

Свойства

Script Extensions

Unicode property listing all scripts that use a character, broader than the single-valued Script property. Common characters like digits have Script=Common but Script_Extensions={many scripts}.

What is the Script_Extensions Property?

In Unicode, the Script_Extensions (scx) property is a refinement of the basic Script (sc) property that more accurately describes which scripts a character belongs to when that character is legitimately used in multiple scripts. While Script assigns each character to exactly one script, Script_Extensions assigns a set of scripts — reflecting the reality that many Unicode characters are shared across multiple writing systems.

Script vs. Script_Extensions

The basic Script property assigns every character to exactly one script value (or to Common or Inherited for characters not specific to any single script). Common includes punctuation, digits, and mathematical symbols. Inherited includes combining marks that take the script of their base character.

This single-assignment model created a problem: many characters are unambiguously part of multiple specific scripts, but Script forced an arbitrary choice. For example, the Hiragana iteration mark (U+3005) is used in both Japanese Hiragana and Katakana contexts. With Script alone, it must be assigned to one, making it look "foreign" in the other context.

Script_Extensions was introduced to solve this. A character with scx={Hiragana, Katakana} belongs legitimately to both scripts. Implementations that care about script fidelity should check whether a character's Script_Extensions set intersects the expected script, not just whether the single Script value matches.

# Using the 'regex' package (supports Script_Extensions)
import regex

# Match any character whose Script_Extensions includes Arabic
arabic_pattern = regex.compile(r'\p{Script_Extensions=Arabic}')
# Also matches characters shared with Arabic like Arabic comma (U+060C)
# which is used in Arabic, Thaana, Hanifi Rohingya, and others

How Digits Are Shared

Decimal digits illustrate the shared-character model perfectly. The ASCII digits 0–9 (U+0030–U+0039) have Script=Common, meaning they belong to no specific script. But they are used in hundreds of scripts. Various scripts also have their own digit forms (Arabic-Indic digits ٠١٢, Devanagari digits ०१२, etc.), and some of these are shared across related scripts.

For instance, the Arabic-Indic digits (U+0660–U+0669) are used in Arabic script texts but also appear in Persian and Urdu texts that use Arabic-script variants. Their Script_Extensions lists {Arabic, Thaana} or similar, reflecting authentic multi-script use.

Security Implications for Mixed-Script Detection

The Script_Extensions property is essential for correct implementation of Unicode Identifier Security (UAX #31, UTR #39). When checking whether a string is "mixed-script" (potentially suspicious), naive use of Script over-counts legitimate combinations as suspicious. A string containing a character with Script=Common next to Arabic characters might appear mixed-script, but if the Common character genuinely belongs to Arabic typography, the combination is legitimate.

The UAX #31 "Moderately Restrictive" and "Highly Restrictive" identifier profiles use Script_Extensions for the key check: a string is single-script if there exists some script S such that every character in the string either has S in its Script_Extensions set, has Script=Common, or has Script=Inherited.

Quick Facts

Property	Value
Property name	Script_Extensions (scx)
Data type	Set of script values (vs. single value for Script)
Special Script values	Common (sc=Zyyy), Inherited (sc=Zinh)
Data file	`ScriptExtensions.txt` in Unicode Character Database
Security use	UAX #31 and UTR #39 mixed-script detection
Python support	`regex` package (`\p{Script_Extensions=Arabic}`)
Introduced	Unicode 6.0 (2010)

Связанные термины

Письменность Общая категория Обнаружение смешанных систем письма

Ещё в Свойства

East Asian Width

Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …

Joining Type

Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …

Блок

Именованный непрерывный диапазон кодовых позиций (например, Basic Latin = U+0000–U+007F). Unicode 16.0 …

Двунаправленная категория

Свойство, определяющее поведение символа в двунаправленном тексте (LTR, RTL, слабое, нейтральное). Используется …

Декомпозиция

Отображение символа на его компоненты. Каноническая декомпозиция сохраняет значение (é → e …

Игнорируемый по умолчанию

Символы, не имеющие видимого эффекта и игнорируемые процессами, которые их не поддерживают, …

Каноническая эквивалентность

Две последовательности символов, семантически идентичные и трактуемые как равные. Пример: é (U+00E9) …

Класс объединения

Числовое значение (0–254), управляющее порядком комбинирующих знаков при канонической декомпозиции и определяющее, …

Кластер графем

Воспринимаемый пользователем «символ» — то, что ощущается как единое целое. Может состоять …

Общая категория

Классификация каждой кодовой позиции в одну из 30 категорий (Lu, Ll, Nd, …

← Вернуться к глоссарию