📚 Unicode Fundamentals

Unicode Scripts: How Writing Systems are Organized

Unicode assigns every character to a script property that identifies the writing system it belongs to, such as Latin, Arabic, or Han. This guide explains the Unicode script system, how it differs from blocks, and how scripts are used in internationalization and security.

Published 2021-09-27 · Updated 2025-02-10

When you read text on a screen, you instinctively recognize writing systems — Latin letters in an English sentence, Arabic script flowing right to left, Chinese characters stacked in dense columns. Unicode formalizes this intuition through the Script property: every character is assigned to the writing system it belongs to. This guide explains how the Unicode Script system works, lists the scripts defined in the current standard, shows how Script Extensions handle characters shared between writing systems, and demonstrates how to use script information for internationalization, security, and text analysis.

What Is a Unicode Script?

A Unicode script is a collection of characters used to represent one or more writing systems. The Script property (abbreviated sc) is defined in the Unicode Character Database file Scripts.txt and assigns every code point to exactly one script.

As of Unicode 16.0, there are 168 scripts — ranging from widely used modern scripts like Latin and Arabic to historical scripts like Egyptian Hieroglyphs and Linear B.

Each script has two identifiers:

Form	Example	Where Used
Long name	`Latin`	Documentation, verbose APIs
ISO 15924 code	`Latn`	Four-letter code, BCP 47 language tags

The Special Scripts: Common and Inherited

Two scripts are not true writing systems but serve crucial organizational roles:

Common

Characters with Script=Common are used across multiple writing systems. Examples:

Character	Name	Why Common
0–9	ASCII digits	Used by virtually all modern scripts
. , ; :	Basic punctuation	Shared across scripts
$ € ¥	Currency symbols	Not tied to one writing system
+ = < >	Math operators	Universal
@ # &	Commercial symbols	Universal

Approximately 8,000+ characters are classified as Common.

Inherited

Characters with Script=Inherited inherit their effective script from the preceding base character. These are almost exclusively combining marks:

Character	Name	Behavior
U+0300	Combining Grave Accent	Inherits script of base letter
U+0308	Combining Diaeresis	Inherits script of base letter
U+0302	Combining Circumflex Accent	Inherits script of base letter
U+20E3	Combining Enclosing Keycap	Used in emoji keycap sequences

When ë is stored as e + U+0308, the diaeresis (U+0308, Script=Inherited) takes on the Latin script of the base e. If the same diaeresis appears after a Cyrillic е, it becomes effectively Cyrillic.

Major Modern Scripts

Here is a selection of the most widely used scripts by number of native users:

Script	Code	Characters	Major Languages	Direction
Latin	Latn	~1,500	English, Spanish, French, German, Turkish, Vietnamese	LTR
Han	Hani	~97,000+	Chinese, Japanese (kanji), Korean (hanja)	LTR / TTB
Arabic	Arab	~1,300	Arabic, Farsi, Urdu, Pashto, Malay (Jawi)	RTL
Devanagari	Deva	~164	Hindi, Marathi, Sanskrit, Nepali	LTR
Bengali	Beng	~96	Bengali, Assamese	LTR
Cyrillic	Cyrl	~506	Russian, Ukrainian, Bulgarian, Serbian, Kazakh	LTR
Hangul	Hang	~11,739	Korean	LTR
Tamil	Taml	~72	Tamil	LTR
Telugu	Telu	~96	Telugu	LTR
Thai	Thai	~87	Thai	LTR
Katakana	Kana	~300	Japanese	LTR
Hiragana	Hira	~376	Japanese	LTR
Greek	Grek	~518	Greek	LTR
Hebrew	Hebr	~134	Hebrew, Yiddish	RTL
Georgian	Geor	~173	Georgian	LTR
Armenian	Armn	~96	Armenian	LTR
Ethiopic	Ethi	~523	Amharic, Tigrinya, Oromo	LTR

Historical Scripts

Unicode also encodes scripts no longer in everyday use:

Script	Code	Era	Characters
Egyptian Hieroglyphs	Egyp	~3200 BCE – 400 CE	~5,000+
Cuneiform	Xsux	~3400 BCE – 75 CE	~1,234
Linear B	Linb	~1450 – 1200 BCE	~211
Phoenician	Phnx	~1050 – 150 BCE	~29
Old Persian	Xpeo	~525 – 330 BCE	~50
Gothic	Goth	~350 – 600 CE	~27
Coptic	Copt	~100 CE – present (liturgical)	~137
Old Italic	Ital	~700 – 100 BCE	~39

Scripts vs. Blocks

This distinction trips up many developers. Here is the key difference:

Aspect	Script	Block
Definition	Writing system a character belongs to	Contiguous code point range
Basis	Linguistic function	Code point location
Overlap	A character has exactly one Script value	A character is in exactly one Block
Multi-block	Latin spans 15+ blocks	Each block is one range
Multi-script	A block can contain multiple scripts	A script can span multiple blocks
Stability	Script value is stable once assigned	Block boundaries are stable once created

Example: The code point U+0041 (LATIN CAPITAL LETTER A): - Script: Latin - Block: Basic Latin

The code point U+0030 (DIGIT ZERO): - Script: Common (digits are shared by all scripts) - Block: Basic Latin

Both are in the same block but have different scripts.

Script Extensions

Some characters are legitimately used by multiple scripts. The basic Script property forces a single assignment, but the Script_Extensions property (scx) allows listing all scripts that use the character.

Example: U+0660 — ARABIC-INDIC DIGIT ZERO (٠)

Script = Arabic
Script_Extensions = Arabic, Thaana

This digit is used in both Arabic and Thaana (the script of Dhivehi/Maldivian), so Script_Extensions lists both.

Example: U+3001 — IDEOGRAPHIC COMMA (、)

Script = Common
Script_Extensions = Bopomofo, Hangul, Han, Hiragana, Katakana, Yi

This punctuation character is shared across multiple East Asian scripts.

Why Script Extensions Matter

Mixed-script detection algorithms use Script_Extensions rather than the base Script property. The Highly Restrictive profile from Unicode Technical Standard #39 (Unicode Security Mechanisms) checks whether all characters in a string can be covered by a single script's extensions, which is much more permissive (and accurate) than checking the base Script property.

Querying Script Information in Code

Python

The unicodedata module does not provide script information directly. Use fontTools or the unicodescripts package:

from fontTools.unicodedata import script, script_extension

script("A")         # 'Latn' (Latin)
script("\u0410")       # 'Cyrl' (Cyrillic — А)
script("\u4E2D")       # 'Hani' (Han — 中)
script("0")         # 'Zyyy' (Common)
script("\u0300")       # 'Zinh' (Inherited — combining grave)

# Script Extensions
script_extension("\u0660")  # {'Arab', 'Thaa'} — Arabic-Indic digit zero

Note: Zyyy is the ISO 15924 code for Common, and Zinh is the code for Inherited.

JavaScript

JavaScript regex supports script matching via Unicode property escapes:

// Test for specific scripts
/^\p{Script=Latin}$/u.test("A");       // true
/^\p{Script=Cyrillic}$/u.test("Д");    // true
/^\p{Script=Han}$/u.test("中");        // true
/^\p{Script=Common}$/u.test("3");      // true

// Using Script_Extensions
/^\p{Script_Extensions=Latin}$/u.test("3");  // false (3 is Common)
/^\p{Script_Extensions=Arabic}$/u.test("\u0660"); // true

Java

Character.UnicodeScript script = Character.UnicodeScript.of('A');
// script == Character.UnicodeScript.LATIN

Character.UnicodeScript han = Character.UnicodeScript.of(0x4E2D);
// han == Character.UnicodeScript.HAN

Regular Expressions (PCRE, ICU)

\p{Latin}       — any Latin script character
\p{Cyrillic}    — any Cyrillic character
\p{Han}         — any CJK ideograph
\p{Arabic}      — any Arabic character
\p{Common}      — any Common character (digits, punctuation, symbols)

Practical Applications

1. Script Detection — What Language Family Is This Text?

Detecting which scripts appear in a string is the first step toward language identification:

from fontTools.unicodedata import script
from collections import Counter

def detect_scripts(text: str) -> dict[str, int]:
    counts: Counter[str] = Counter()
    for ch in text:
        sc = script(ch)
        if sc not in ("Zyyy", "Zinh", "Zzzz"):  # Skip Common, Inherited, Unknown
            counts[sc] += 1
    return dict(counts.most_common())

detect_scripts("Hello 世界!")
# {'Latn': 5, 'Hani': 2}

detect_scripts("Привет мир")
# {'Cyrl': 9}

2. Mixed-Script Detection — Security

One of the most important security applications of script detection is identifying homograph attacks — where an attacker registers a domain name using characters from multiple scripts that visually resemble ASCII. For example, mixing Cyrillic а (U+0430) with Latin a (U+0061).

Unicode Technical Standard #39 defines restriction levels:

Level	Rule	Example
ASCII Only	Only ASCII characters	`example.com`
Single Script	All non-Common/Inherited chars share one script	`münchen.de`
Highly Restrictive	All chars coverable by one Script_Extensions set	Most real-world text
Moderately Restrictive	Allows certain common script combinations	Latin + Han + Hiragana

from fontTools.unicodedata import script

def is_single_script(text: str) -> bool:
    scripts = set()
    for ch in text:
        sc = script(ch)
        if sc not in ("Zyyy", "Zinh"):
            scripts.add(sc)
    return len(scripts) <= 1

is_single_script("Hello")         # True (all Latin)
is_single_script("Hеllo")         # False! (Latin H + Cyrillic е + Latin llo)

3. Text Segmentation — Japanese

Japanese text mixes three scripts (Hiragana, Katakana, Han) with no spaces between words. Script boundaries provide valuable segmentation hints:

東京タワーの高さは333mです
Han: 東京 (Tokyo)
Katakana: タワー (Tower)
Hiragana: の (particle)
Han: 高さ (height)
Hiragana: は (particle)
Common: 333m
Hiragana: です (copula)

Word-break algorithms for Japanese use script transitions as one of many signals for identifying word boundaries.

4. Font Selection

Text rendering engines use the Script property to select appropriate fonts. When a document contains Latin, Arabic, and CJK text, the renderer needs three different fonts (or a single font with coverage for all three scripts). The Script property tells the engine which font to apply to each run of text.

5. Bidirectional Text Layout

The Unicode Bidirectional Algorithm (UBA) uses script information to help determine text direction. Arabic and Hebrew characters have a strong right-to-left (RTL) direction, while Latin and CJK characters have a strong left-to-right (LTR) direction. Common characters (like digits and punctuation) are "weak" or "neutral" and inherit direction from surrounding strong characters.

The Scripts.txt Data File

The authoritative source for script assignments is the UCD file Scripts.txt:

# Scripts-16.0.0.txt

0000..0040    ; Common     # Cc, Sm, Po, ...
0041..005A    ; Latin      # Lu  [26] LATIN CAPITAL LETTER A..Z
005B..0060    ; Common     # Ps, Po, Sk, ...
0061..007A    ; Latin      # Ll  [26] LATIN SMALL LETTER A..Z
007B..009F    ; Common     # Pe, Cc, ...
...
0600..0605    ; Arabic     # Cf  [6] ARABIC NUMBER SIGN..

Each line assigns a range of code points to a script. Download from: https://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt

The companion file ScriptExtensions.txt provides the Script_Extensions data.

All 168 Scripts at a Glance

The 168 scripts in Unicode 16.0 break down roughly as:

Category	Count	Examples
Modern, widely used	~30	Latin, Arabic, Han, Cyrillic, Devanagari, Bengali, Thai
Modern, regional	~50	Georgian, Armenian, Ethiopic, Tibetan, Sinhala, Myanmar
Historical	~60	Egyptian Hieroglyphs, Cuneiform, Linear A, Phoenician, Gothic
Notation systems	~10	Braille, Musical Symbols, SignWriting
Special values	3	Common, Inherited, Unknown

The "Unknown" script (Zzzz) is assigned to unassigned and non-character code points.

Edge Cases

Han unification: CJK Unified Ideographs are assigned Script=Han, even though they are used by Chinese, Japanese, and Korean — three very different languages. The Script property does not distinguish between these uses; that requires language-level metadata.

Emoji: Most emoji have Script=Common because they are not tied to any writing system. Some emoji-like characters (e.g., Mahjong tiles) may have specific script assignments.

Digits from other scripts: While ASCII digits 0–9 are Script=Common, digits from other scripts are assigned to those scripts: Devanagari digits (०–९) are Script=Devanagari, Thai digits (๐–๙) are Script=Thai.

Latin vs. Common overlap: Many characters that look "Latin" are actually Script=Common because they are used beyond Latin contexts. Examples include the section sign (§), the pilcrow (¶), and most mathematical operators.

Summary

The Unicode Script property assigns every character to a writing system (168 scripts in Unicode 16.0).
Common and Inherited are special scripts for shared characters and combining marks.
Script_Extensions lists all scripts that legitimately use a character — essential for accurate mixed-script detection.
Scripts differ from blocks (code point ranges) and General Categories (character types): a script is a linguistic classification.
Use script detection for language identification, security (homograph attacks), text segmentation, font selection, and bidirectional layout.
Query scripts with fontTools.unicodedata.script() in Python, \p{Script=Latin} in JavaScript regex, or Character.UnicodeScript.of() in Java.
The authoritative source is the UCD Scripts.txt file, supplemented by ScriptExtensions.txt.