📚 Unicode Fundamentals

Unicode Scripts: How Writing Systems are Organized

Unicode assigns every character to a script property that identifies the writing system it belongs to, such as Latin, Arabic, or Han. This guide explains the Unicode script system, how it differs from blocks, and how scripts are used in internationalization and security.

·

When you read text on a screen, you instinctively recognize writing systems — Latin letters in an English sentence, Arabic script flowing right to left, Chinese characters stacked in dense columns. Unicode formalizes this intuition through the Script property: every character is assigned to the writing system it belongs to. This guide explains how the Unicode Script system works, lists the scripts defined in the current standard, shows how Script Extensions handle characters shared between writing systems, and demonstrates how to use script information for internationalization, security, and text analysis.

What Is a Unicode Script?

A Unicode script is a collection of characters used to represent one or more writing systems. The Script property (abbreviated sc) is defined in the Unicode Character Database file Scripts.txt and assigns every code point to exactly one script.

As of Unicode 16.0, there are 168 scripts — ranging from widely used modern scripts like Latin and Arabic to historical scripts like Egyptian Hieroglyphs and Linear B.

Each script has two identifiers:

Form Example Where Used
Long name Latin Documentation, verbose APIs
ISO 15924 code Latn Four-letter code, BCP 47 language tags

The Special Scripts: Common and Inherited

Two scripts are not true writing systems but serve crucial organizational roles:

Common

Characters with Script=Common are used across multiple writing systems. Examples:

Character Name Why Common
0–9 ASCII digits Used by virtually all modern scripts
. , ; : Basic punctuation Shared across scripts
$ € ¥ Currency symbols Not tied to one writing system
+ = < > Math operators Universal
@ # & Commercial symbols Universal

Approximately 8,000+ characters are classified as Common.

Inherited

Characters with Script=Inherited inherit their effective script from the preceding base character. These are almost exclusively combining marks:

Character Name Behavior
U+0300 Combining Grave Accent Inherits script of base letter
U+0308 Combining Diaeresis Inherits script of base letter
U+0302 Combining Circumflex Accent Inherits script of base letter
U+20E3 Combining Enclosing Keycap Used in emoji keycap sequences

When ë is stored as e + U+0308, the diaeresis (U+0308, Script=Inherited) takes on the Latin script of the base e. If the same diaeresis appears after a Cyrillic е, it becomes effectively Cyrillic.

Major Modern Scripts

Here is a selection of the most widely used scripts by number of native users:

Script Code Characters Major Languages Direction
Latin Latn ~1,500 English, Spanish, French, German, Turkish, Vietnamese LTR
Han Hani ~97,000+ Chinese, Japanese (kanji), Korean (hanja) LTR / TTB
Arabic Arab ~1,300 Arabic, Farsi, Urdu, Pashto, Malay (Jawi) RTL
Devanagari Deva ~164 Hindi, Marathi, Sanskrit, Nepali LTR
Bengali Beng ~96 Bengali, Assamese LTR
Cyrillic Cyrl ~506 Russian, Ukrainian, Bulgarian, Serbian, Kazakh LTR
Hangul Hang ~11,739 Korean LTR
Tamil Taml ~72 Tamil LTR
Telugu Telu ~96 Telugu LTR
Thai Thai ~87 Thai LTR
Katakana Kana ~300 Japanese LTR
Hiragana Hira ~376 Japanese LTR
Greek Grek ~518 Greek LTR
Hebrew Hebr ~134 Hebrew, Yiddish RTL
Georgian Geor ~173 Georgian LTR
Armenian Armn ~96 Armenian LTR
Ethiopic Ethi ~523 Amharic, Tigrinya, Oromo LTR

Historical Scripts

Unicode also encodes scripts no longer in everyday use:

Script Code Era Characters
Egyptian Hieroglyphs Egyp ~3200 BCE – 400 CE ~5,000+
Cuneiform Xsux ~3400 BCE – 75 CE ~1,234
Linear B Linb ~1450 – 1200 BCE ~211
Phoenician Phnx ~1050 – 150 BCE ~29
Old Persian Xpeo ~525 – 330 BCE ~50
Gothic Goth ~350 – 600 CE ~27
Coptic Copt ~100 CE – present (liturgical) ~137
Old Italic Ital ~700 – 100 BCE ~39

Scripts vs. Blocks

This distinction trips up many developers. Here is the key difference:

Aspect Script Block
Definition Writing system a character belongs to Contiguous code point range
Basis Linguistic function Code point location
Overlap A character has exactly one Script value A character is in exactly one Block
Multi-block Latin spans 15+ blocks Each block is one range
Multi-script A block can contain multiple scripts A script can span multiple blocks
Stability Script value is stable once assigned Block boundaries are stable once created

Example: The code point U+0041 (LATIN CAPITAL LETTER A): - Script: Latin - Block: Basic Latin

The code point U+0030 (DIGIT ZERO): - Script: Common (digits are shared by all scripts) - Block: Basic Latin

Both are in the same block but have different scripts.

Script Extensions

Some characters are legitimately used by multiple scripts. The basic Script property forces a single assignment, but the Script_Extensions property (scx) allows listing all scripts that use the character.

Example: U+0660 — ARABIC-INDIC DIGIT ZERO (٠)

  • Script = Arabic
  • Script_Extensions = Arabic, Thaana

This digit is used in both Arabic and Thaana (the script of Dhivehi/Maldivian), so Script_Extensions lists both.

Example: U+3001 — IDEOGRAPHIC COMMA (、)

  • Script = Common
  • Script_Extensions = Bopomofo, Hangul, Han, Hiragana, Katakana, Yi

This punctuation character is shared across multiple East Asian scripts.

Why Script Extensions Matter

Mixed-script detection algorithms use Script_Extensions rather than the base Script property. The Highly Restrictive profile from Unicode Technical Standard #39 (Unicode Security Mechanisms) checks whether all characters in a string can be covered by a single script's extensions, which is much more permissive (and accurate) than checking the base Script property.

Querying Script Information in Code

Python

The unicodedata module does not provide script information directly. Use fontTools or the unicodescripts package:

from fontTools.unicodedata import script, script_extension

script("A")         # 'Latn' (Latin)
script("\u0410")       # 'Cyrl' (Cyrillic — А)
script("\u4E2D")       # 'Hani' (Han — 中)
script("0")         # 'Zyyy' (Common)
script("\u0300")       # 'Zinh' (Inherited — combining grave)

# Script Extensions
script_extension("\u0660")  # {'Arab', 'Thaa'} — Arabic-Indic digit zero

Note: Zyyy is the ISO 15924 code for Common, and Zinh is the code for Inherited.

JavaScript

JavaScript regex supports script matching via Unicode property escapes:

// Test for specific scripts
/^\p{Script=Latin}$/u.test("A");       // true
/^\p{Script=Cyrillic}$/u.test("Д");    // true
/^\p{Script=Han}$/u.test("中");        // true
/^\p{Script=Common}$/u.test("3");      // true

// Using Script_Extensions
/^\p{Script_Extensions=Latin}$/u.test("3");  // false (3 is Common)
/^\p{Script_Extensions=Arabic}$/u.test("\u0660"); // true

Java

Character.UnicodeScript script = Character.UnicodeScript.of('A');
// script == Character.UnicodeScript.LATIN

Character.UnicodeScript han = Character.UnicodeScript.of(0x4E2D);
// han == Character.UnicodeScript.HAN

Regular Expressions (PCRE, ICU)

\p{Latin}       — any Latin script character
\p{Cyrillic}    — any Cyrillic character
\p{Han}         — any CJK ideograph
\p{Arabic}      — any Arabic character
\p{Common}      — any Common character (digits, punctuation, symbols)

Practical Applications

1. Script Detection — What Language Family Is This Text?

Detecting which scripts appear in a string is the first step toward language identification:

from fontTools.unicodedata import script
from collections import Counter

def detect_scripts(text: str) -> dict[str, int]:
    counts: Counter[str] = Counter()
    for ch in text:
        sc = script(ch)
        if sc not in ("Zyyy", "Zinh", "Zzzz"):  # Skip Common, Inherited, Unknown
            counts[sc] += 1
    return dict(counts.most_common())

detect_scripts("Hello 世界!")
# {'Latn': 5, 'Hani': 2}

detect_scripts("Привет мир")
# {'Cyrl': 9}

2. Mixed-Script Detection — Security

One of the most important security applications of script detection is identifying homograph attacks — where an attacker registers a domain name using characters from multiple scripts that visually resemble ASCII. For example, mixing Cyrillic а (U+0430) with Latin a (U+0061).

Unicode Technical Standard #39 defines restriction levels:

Level Rule Example
ASCII Only Only ASCII characters example.com
Single Script All non-Common/Inherited chars share one script münchen.de
Highly Restrictive All chars coverable by one Script_Extensions set Most real-world text
Moderately Restrictive Allows certain common script combinations Latin + Han + Hiragana
from fontTools.unicodedata import script

def is_single_script(text: str) -> bool:
    scripts = set()
    for ch in text:
        sc = script(ch)
        if sc not in ("Zyyy", "Zinh"):
            scripts.add(sc)
    return len(scripts) <= 1

is_single_script("Hello")         # True (all Latin)
is_single_script("Hеllo")         # False! (Latin H + Cyrillic е + Latin llo)

3. Text Segmentation — Japanese

Japanese text mixes three scripts (Hiragana, Katakana, Han) with no spaces between words. Script boundaries provide valuable segmentation hints:

東京タワーの高さは333mです
Han: 東京 (Tokyo)
Katakana: タワー (Tower)
Hiragana: の (particle)
Han: 高さ (height)
Hiragana: は (particle)
Common: 333m
Hiragana: です (copula)

Word-break algorithms for Japanese use script transitions as one of many signals for identifying word boundaries.

4. Font Selection

Text rendering engines use the Script property to select appropriate fonts. When a document contains Latin, Arabic, and CJK text, the renderer needs three different fonts (or a single font with coverage for all three scripts). The Script property tells the engine which font to apply to each run of text.

5. Bidirectional Text Layout

The Unicode Bidirectional Algorithm (UBA) uses script information to help determine text direction. Arabic and Hebrew characters have a strong right-to-left (RTL) direction, while Latin and CJK characters have a strong left-to-right (LTR) direction. Common characters (like digits and punctuation) are "weak" or "neutral" and inherit direction from surrounding strong characters.

The Scripts.txt Data File

The authoritative source for script assignments is the UCD file Scripts.txt:

# Scripts-16.0.0.txt

0000..0040    ; Common     # Cc, Sm, Po, ...
0041..005A    ; Latin      # Lu  [26] LATIN CAPITAL LETTER A..Z
005B..0060    ; Common     # Ps, Po, Sk, ...
0061..007A    ; Latin      # Ll  [26] LATIN SMALL LETTER A..Z
007B..009F    ; Common     # Pe, Cc, ...
...
0600..0605    ; Arabic     # Cf  [6] ARABIC NUMBER SIGN..

Each line assigns a range of code points to a script. Download from: https://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt

The companion file ScriptExtensions.txt provides the Script_Extensions data.

All 168 Scripts at a Glance

The 168 scripts in Unicode 16.0 break down roughly as:

Category Count Examples
Modern, widely used ~30 Latin, Arabic, Han, Cyrillic, Devanagari, Bengali, Thai
Modern, regional ~50 Georgian, Armenian, Ethiopic, Tibetan, Sinhala, Myanmar
Historical ~60 Egyptian Hieroglyphs, Cuneiform, Linear A, Phoenician, Gothic
Notation systems ~10 Braille, Musical Symbols, SignWriting
Special values 3 Common, Inherited, Unknown

The "Unknown" script (Zzzz) is assigned to unassigned and non-character code points.

Edge Cases

Han unification: CJK Unified Ideographs are assigned Script=Han, even though they are used by Chinese, Japanese, and Korean — three very different languages. The Script property does not distinguish between these uses; that requires language-level metadata.

Emoji: Most emoji have Script=Common because they are not tied to any writing system. Some emoji-like characters (e.g., Mahjong tiles) may have specific script assignments.

Digits from other scripts: While ASCII digits 0–9 are Script=Common, digits from other scripts are assigned to those scripts: Devanagari digits (०–९) are Script=Devanagari, Thai digits (๐–๙) are Script=Thai.

Latin vs. Common overlap: Many characters that look "Latin" are actually Script=Common because they are used beyond Latin contexts. Examples include the section sign (§), the pilcrow (¶), and most mathematical operators.

Summary

  • The Unicode Script property assigns every character to a writing system (168 scripts in Unicode 16.0).
  • Common and Inherited are special scripts for shared characters and combining marks.
  • Script_Extensions lists all scripts that legitimately use a character — essential for accurate mixed-script detection.
  • Scripts differ from blocks (code point ranges) and General Categories (character types): a script is a linguistic classification.
  • Use script detection for language identification, security (homograph attacks), text segmentation, font selection, and bidirectional layout.
  • Query scripts with fontTools.unicodedata.script() in Python, \p{Script=Latin} in JavaScript regex, or Character.UnicodeScript.of() in Java.
  • The authoritative source is the UCD Scripts.txt file, supplemented by ScriptExtensions.txt.

Ещё в Unicode Fundamentals

What is Unicode? A Complete Guide

Unicode is the universal character encoding standard that assigns a unique number …

UTF-8 Encoding Explained

UTF-8 is the dominant character encoding on the web, capable of representing …

UTF-8 vs UTF-16 vs UTF-32: When to Use Each

UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …

What is a Unicode Code Point?

A Unicode code point is the unique number assigned to each character …

Unicode Planes and the BMP

Unicode is divided into 17 planes, each containing up to 65,536 code …

Understanding Byte Order Mark (BOM)

The Byte Order Mark (BOM) is a special Unicode character used at …

Surrogate Pairs Explained

Surrogate pairs are a mechanism in UTF-16 that allows code points outside …

ASCII to Unicode: The Evolution of Character Encoding

ASCII defined 128 characters for the English alphabet and was the foundation …

Unicode Normalization: NFC, NFD, NFKC, NFKD

The same visible character can be represented by multiple different byte sequences …

The Unicode Bidirectional Algorithm

The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …

Unicode General Categories Explained

Every Unicode character belongs to a general category such as Letter, Number, …

Understanding Unicode Blocks

Unicode blocks are contiguous ranges of code points grouped by script or …

What are Combining Characters?

Combining characters are Unicode code points that attach to a preceding base …

Grapheme Clusters vs Code Points

A single visible character on screen — called a grapheme cluster — …

Unicode Confusables: A Security Guide

Unicode confusables are characters that look identical or nearly identical to others, …

Zero Width Characters: What They Are and Why They Matter

Zero-width characters are invisible Unicode code points that affect text layout, joining, …

Unicode Whitespace Characters Guide

Unicode defines over two dozen whitespace characters beyond the ordinary space, including …

History of Unicode

Unicode began in 1987 as a collaboration between engineers at Apple and …

Unicode Versions Timeline

Unicode has released major versions regularly since 1.0 in 1991, with each …