Unicode Scripts: How Writing Systems are Organized
Unicode assigns every character to a script property that identifies the writing system it belongs to, such as Latin, Arabic, or Han. This guide explains the Unicode script system, how it differs from blocks, and how scripts are used in internationalization and security.
When you read text on a screen, you instinctively recognize writing systems — Latin letters in an English sentence, Arabic script flowing right to left, Chinese characters stacked in dense columns. Unicode formalizes this intuition through the Script property: every character is assigned to the writing system it belongs to. This guide explains how the Unicode Script system works, lists the scripts defined in the current standard, shows how Script Extensions handle characters shared between writing systems, and demonstrates how to use script information for internationalization, security, and text analysis.
What Is a Unicode Script?
A Unicode script is a collection of characters used to represent one or more writing
systems. The Script property (abbreviated sc) is defined in the Unicode Character Database
file Scripts.txt and assigns every code point to exactly one script.
As of Unicode 16.0, there are 168 scripts — ranging from widely used modern scripts like Latin and Arabic to historical scripts like Egyptian Hieroglyphs and Linear B.
Each script has two identifiers:
| Form | Example | Where Used |
|---|---|---|
| Long name | Latin |
Documentation, verbose APIs |
| ISO 15924 code | Latn |
Four-letter code, BCP 47 language tags |
The Special Scripts: Common and Inherited
Two scripts are not true writing systems but serve crucial organizational roles:
Common
Characters with Script=Common are used across multiple writing systems. Examples:
| Character | Name | Why Common |
|---|---|---|
| 0–9 | ASCII digits | Used by virtually all modern scripts |
| . , ; : | Basic punctuation | Shared across scripts |
| $ € ¥ | Currency symbols | Not tied to one writing system |
| + = < > | Math operators | Universal |
| @ # & | Commercial symbols | Universal |
Approximately 8,000+ characters are classified as Common.
Inherited
Characters with Script=Inherited inherit their effective script from the preceding base
character. These are almost exclusively combining marks:
| Character | Name | Behavior |
|---|---|---|
| U+0300 | Combining Grave Accent | Inherits script of base letter |
| U+0308 | Combining Diaeresis | Inherits script of base letter |
| U+0302 | Combining Circumflex Accent | Inherits script of base letter |
| U+20E3 | Combining Enclosing Keycap | Used in emoji keycap sequences |
When ë is stored as e + U+0308, the diaeresis (U+0308, Script=Inherited) takes on the
Latin script of the base e. If the same diaeresis appears after a Cyrillic е, it becomes
effectively Cyrillic.
Major Modern Scripts
Here is a selection of the most widely used scripts by number of native users:
| Script | Code | Characters | Major Languages | Direction |
|---|---|---|---|---|
| Latin | Latn | ~1,500 | English, Spanish, French, German, Turkish, Vietnamese | LTR |
| Han | Hani | ~97,000+ | Chinese, Japanese (kanji), Korean (hanja) | LTR / TTB |
| Arabic | Arab | ~1,300 | Arabic, Farsi, Urdu, Pashto, Malay (Jawi) | RTL |
| Devanagari | Deva | ~164 | Hindi, Marathi, Sanskrit, Nepali | LTR |
| Bengali | Beng | ~96 | Bengali, Assamese | LTR |
| Cyrillic | Cyrl | ~506 | Russian, Ukrainian, Bulgarian, Serbian, Kazakh | LTR |
| Hangul | Hang | ~11,739 | Korean | LTR |
| Tamil | Taml | ~72 | Tamil | LTR |
| Telugu | Telu | ~96 | Telugu | LTR |
| Thai | Thai | ~87 | Thai | LTR |
| Katakana | Kana | ~300 | Japanese | LTR |
| Hiragana | Hira | ~376 | Japanese | LTR |
| Greek | Grek | ~518 | Greek | LTR |
| Hebrew | Hebr | ~134 | Hebrew, Yiddish | RTL |
| Georgian | Geor | ~173 | Georgian | LTR |
| Armenian | Armn | ~96 | Armenian | LTR |
| Ethiopic | Ethi | ~523 | Amharic, Tigrinya, Oromo | LTR |
Historical Scripts
Unicode also encodes scripts no longer in everyday use:
| Script | Code | Era | Characters |
|---|---|---|---|
| Egyptian Hieroglyphs | Egyp | ~3200 BCE – 400 CE | ~5,000+ |
| Cuneiform | Xsux | ~3400 BCE – 75 CE | ~1,234 |
| Linear B | Linb | ~1450 – 1200 BCE | ~211 |
| Phoenician | Phnx | ~1050 – 150 BCE | ~29 |
| Old Persian | Xpeo | ~525 – 330 BCE | ~50 |
| Gothic | Goth | ~350 – 600 CE | ~27 |
| Coptic | Copt | ~100 CE – present (liturgical) | ~137 |
| Old Italic | Ital | ~700 – 100 BCE | ~39 |
Scripts vs. Blocks
This distinction trips up many developers. Here is the key difference:
| Aspect | Script | Block |
|---|---|---|
| Definition | Writing system a character belongs to | Contiguous code point range |
| Basis | Linguistic function | Code point location |
| Overlap | A character has exactly one Script value | A character is in exactly one Block |
| Multi-block | Latin spans 15+ blocks | Each block is one range |
| Multi-script | A block can contain multiple scripts | A script can span multiple blocks |
| Stability | Script value is stable once assigned | Block boundaries are stable once created |
Example: The code point U+0041 (LATIN CAPITAL LETTER A): - Script: Latin - Block: Basic Latin
The code point U+0030 (DIGIT ZERO): - Script: Common (digits are shared by all scripts) - Block: Basic Latin
Both are in the same block but have different scripts.
Script Extensions
Some characters are legitimately used by multiple scripts. The basic Script property forces
a single assignment, but the Script_Extensions property (scx) allows listing all
scripts that use the character.
Example: U+0660 — ARABIC-INDIC DIGIT ZERO (٠)
Script= ArabicScript_Extensions= Arabic, Thaana
This digit is used in both Arabic and Thaana (the script of Dhivehi/Maldivian), so Script_Extensions lists both.
Example: U+3001 — IDEOGRAPHIC COMMA (、)
Script= CommonScript_Extensions= Bopomofo, Hangul, Han, Hiragana, Katakana, Yi
This punctuation character is shared across multiple East Asian scripts.
Why Script Extensions Matter
Mixed-script detection algorithms use Script_Extensions rather than the base Script property. The Highly Restrictive profile from Unicode Technical Standard #39 (Unicode Security Mechanisms) checks whether all characters in a string can be covered by a single script's extensions, which is much more permissive (and accurate) than checking the base Script property.
Querying Script Information in Code
Python
The unicodedata module does not provide script information directly. Use fontTools or
the unicodescripts package:
from fontTools.unicodedata import script, script_extension
script("A") # 'Latn' (Latin)
script("\u0410") # 'Cyrl' (Cyrillic — А)
script("\u4E2D") # 'Hani' (Han — 中)
script("0") # 'Zyyy' (Common)
script("\u0300") # 'Zinh' (Inherited — combining grave)
# Script Extensions
script_extension("\u0660") # {'Arab', 'Thaa'} — Arabic-Indic digit zero
Note: Zyyy is the ISO 15924 code for Common, and Zinh is the code for Inherited.
JavaScript
JavaScript regex supports script matching via Unicode property escapes:
// Test for specific scripts
/^\p{Script=Latin}$/u.test("A"); // true
/^\p{Script=Cyrillic}$/u.test("Д"); // true
/^\p{Script=Han}$/u.test("中"); // true
/^\p{Script=Common}$/u.test("3"); // true
// Using Script_Extensions
/^\p{Script_Extensions=Latin}$/u.test("3"); // false (3 is Common)
/^\p{Script_Extensions=Arabic}$/u.test("\u0660"); // true
Java
Character.UnicodeScript script = Character.UnicodeScript.of('A');
// script == Character.UnicodeScript.LATIN
Character.UnicodeScript han = Character.UnicodeScript.of(0x4E2D);
// han == Character.UnicodeScript.HAN
Regular Expressions (PCRE, ICU)
\p{Latin} — any Latin script character
\p{Cyrillic} — any Cyrillic character
\p{Han} — any CJK ideograph
\p{Arabic} — any Arabic character
\p{Common} — any Common character (digits, punctuation, symbols)
Practical Applications
1. Script Detection — What Language Family Is This Text?
Detecting which scripts appear in a string is the first step toward language identification:
from fontTools.unicodedata import script
from collections import Counter
def detect_scripts(text: str) -> dict[str, int]:
counts: Counter[str] = Counter()
for ch in text:
sc = script(ch)
if sc not in ("Zyyy", "Zinh", "Zzzz"): # Skip Common, Inherited, Unknown
counts[sc] += 1
return dict(counts.most_common())
detect_scripts("Hello 世界!")
# {'Latn': 5, 'Hani': 2}
detect_scripts("Привет мир")
# {'Cyrl': 9}
2. Mixed-Script Detection — Security
One of the most important security applications of script detection is identifying
homograph attacks — where an attacker registers a domain name using characters from
multiple scripts that visually resemble ASCII. For example, mixing Cyrillic а (U+0430)
with Latin a (U+0061).
Unicode Technical Standard #39 defines restriction levels:
| Level | Rule | Example |
|---|---|---|
| ASCII Only | Only ASCII characters | example.com |
| Single Script | All non-Common/Inherited chars share one script | münchen.de |
| Highly Restrictive | All chars coverable by one Script_Extensions set | Most real-world text |
| Moderately Restrictive | Allows certain common script combinations | Latin + Han + Hiragana |
from fontTools.unicodedata import script
def is_single_script(text: str) -> bool:
scripts = set()
for ch in text:
sc = script(ch)
if sc not in ("Zyyy", "Zinh"):
scripts.add(sc)
return len(scripts) <= 1
is_single_script("Hello") # True (all Latin)
is_single_script("Hеllo") # False! (Latin H + Cyrillic е + Latin llo)
3. Text Segmentation — Japanese
Japanese text mixes three scripts (Hiragana, Katakana, Han) with no spaces between words. Script boundaries provide valuable segmentation hints:
東京タワーの高さは333mです
Han: 東京 (Tokyo)
Katakana: タワー (Tower)
Hiragana: の (particle)
Han: 高さ (height)
Hiragana: は (particle)
Common: 333m
Hiragana: です (copula)
Word-break algorithms for Japanese use script transitions as one of many signals for identifying word boundaries.
4. Font Selection
Text rendering engines use the Script property to select appropriate fonts. When a document contains Latin, Arabic, and CJK text, the renderer needs three different fonts (or a single font with coverage for all three scripts). The Script property tells the engine which font to apply to each run of text.
5. Bidirectional Text Layout
The Unicode Bidirectional Algorithm (UBA) uses script information to help determine text direction. Arabic and Hebrew characters have a strong right-to-left (RTL) direction, while Latin and CJK characters have a strong left-to-right (LTR) direction. Common characters (like digits and punctuation) are "weak" or "neutral" and inherit direction from surrounding strong characters.
The Scripts.txt Data File
The authoritative source for script assignments is the UCD file Scripts.txt:
# Scripts-16.0.0.txt
0000..0040 ; Common # Cc, Sm, Po, ...
0041..005A ; Latin # Lu [26] LATIN CAPITAL LETTER A..Z
005B..0060 ; Common # Ps, Po, Sk, ...
0061..007A ; Latin # Ll [26] LATIN SMALL LETTER A..Z
007B..009F ; Common # Pe, Cc, ...
...
0600..0605 ; Arabic # Cf [6] ARABIC NUMBER SIGN..
Each line assigns a range of code points to a script. Download from:
https://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt
The companion file ScriptExtensions.txt provides the Script_Extensions data.
All 168 Scripts at a Glance
The 168 scripts in Unicode 16.0 break down roughly as:
| Category | Count | Examples |
|---|---|---|
| Modern, widely used | ~30 | Latin, Arabic, Han, Cyrillic, Devanagari, Bengali, Thai |
| Modern, regional | ~50 | Georgian, Armenian, Ethiopic, Tibetan, Sinhala, Myanmar |
| Historical | ~60 | Egyptian Hieroglyphs, Cuneiform, Linear A, Phoenician, Gothic |
| Notation systems | ~10 | Braille, Musical Symbols, SignWriting |
| Special values | 3 | Common, Inherited, Unknown |
The "Unknown" script (Zzzz) is assigned to unassigned and non-character code points.
Edge Cases
Han unification: CJK Unified Ideographs are assigned Script=Han, even though they are
used by Chinese, Japanese, and Korean — three very different languages. The Script property
does not distinguish between these uses; that requires language-level metadata.
Emoji: Most emoji have Script=Common because they are not tied to any writing system.
Some emoji-like characters (e.g., Mahjong tiles) may have specific script assignments.
Digits from other scripts: While ASCII digits 0–9 are Script=Common, digits from other
scripts are assigned to those scripts: Devanagari digits (०–९) are Script=Devanagari,
Thai digits (๐–๙) are Script=Thai.
Latin vs. Common overlap: Many characters that look "Latin" are actually Script=Common
because they are used beyond Latin contexts. Examples include the section sign (§), the
pilcrow (¶), and most mathematical operators.
Summary
- The Unicode Script property assigns every character to a writing system (168 scripts in Unicode 16.0).
- Common and Inherited are special scripts for shared characters and combining marks.
- Script_Extensions lists all scripts that legitimately use a character — essential for accurate mixed-script detection.
- Scripts differ from blocks (code point ranges) and General Categories (character types): a script is a linguistic classification.
- Use script detection for language identification, security (homograph attacks), text segmentation, font selection, and bidirectional layout.
- Query scripts with
fontTools.unicodedata.script()in Python,\p{Script=Latin}in JavaScript regex, orCharacter.UnicodeScript.of()in Java. - The authoritative source is the UCD
Scripts.txtfile, supplemented byScriptExtensions.txt.
Unicode Fundamentals의 더 많은 가이드
Unicode is the universal character encoding standard that assigns a unique number …
UTF-8 is the dominant character encoding on the web, capable of representing …
UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …
A Unicode code point is the unique number assigned to each character …
Unicode is divided into 17 planes, each containing up to 65,536 code …
The Byte Order Mark (BOM) is a special Unicode character used at …
Surrogate pairs are a mechanism in UTF-16 that allows code points outside …
ASCII defined 128 characters for the English alphabet and was the foundation …
The same visible character can be represented by multiple different byte sequences …
The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …
Every Unicode character belongs to a general category such as Letter, Number, …
Unicode blocks are contiguous ranges of code points grouped by script or …
Combining characters are Unicode code points that attach to a preceding base …
A single visible character on screen — called a grapheme cluster — …
Unicode confusables are characters that look identical or nearly identical to others, …
Zero-width characters are invisible Unicode code points that affect text layout, joining, …
Unicode defines over two dozen whitespace characters beyond the ordinary space, including …
Unicode began in 1987 as a collaboration between engineers at Apple and …
Unicode has released major versions regularly since 1.0 in 1991, with each …