文字系统
字符所属的书写系统(如拉丁文、西里尔文、汉字)。Unicode 16.0定义了168种文字系统,Script属性是安全检测和混合文字系统检测的关键。
What Is a Unicode Script?
A Unicode Script is a collection of characters used to write one or more human languages. Unlike blocks (which are contiguous code-point ranges), a script groups characters by their cultural and historical writing system: Latin, Arabic, Han, Devanagari, Georgian, and so on. Unicode 15.1 defines 161 scripts.
Every assigned character carries a Script property value. Characters not associated with any specific writing system receive the value Common (punctuation, digits, emoji) or Inherited (combining marks that inherit the script of their base character, such as combining diacritical marks).
Script vs. Block
The distinction is important in practice:
- The Latin script spans dozens of blocks: Basic Latin, Latin-1 Supplement, Latin Extended-A through Latin Extended-G, IPA Extensions, and more.
- The CJK Unified Ideographs block contains characters that belong to multiple scripts (Han, and historically Bopomofo components).
- The Letterlike Symbols block is
Script=Commonbecause those symbols are used across many writing systems.
# Python 3.14+ exposes Script via unicodedata
import unicodedata
# unicodedata.script() — available in Python 3.14
for char in ["A", "α", "ب", "あ", "中"]:
try:
script = unicodedata.script(char)
except AttributeError:
script = "(requires Python 3.14)"
print(f"{char} Script={script}")
# A Script=Latin
# α Script=Greek
# ب Script=Arabic
# あ Script=Hiragana
# 中 Script=Han
# On older Python, use the 'regex' package:
import regex
print(bool(regex.match(r'\p{Script=Latin}', 'A'))) # True
print(bool(regex.match(r'\p{Script=Arabic}', 'ب'))) # True
Script Extensions
Some characters are legitimately used in more than one script. The Script_Extensions property lists all scripts that use a given character. For example, U+0951 DEVANAGARI STRESS SIGN UDATTA appears in Devanagari, Bengali, Gujarati, and a dozen other Indic scripts—its Script is Inherited, but its Script_Extensions lists all the scripts that employ it. Implementations that need precise script-segmentation should consult Script_Extensions rather than Script alone.
# regex package supports Script_Extensions:
import regex
# Match a character used in the Devanagari OR Bengali script
pattern = regex.compile(r'[\p{Script_Extensions=Devanagari}\p{Script_Extensions=Bengali}]')
Quick Facts
| Property | Value |
|---|---|
| Unicode property name | Script |
| Short alias | sc |
| Number of scripts (Unicode 15.1) | 161 |
| Special values | Common, Inherited, Unknown |
| Python 3.14 | unicodedata.script(char) |
| Older Python | regex package, \p{Script=Latin} |
| Companion property | Script_Extensions (scx) |
| Spec reference | Unicode Standard Annex #24 (UAX #24) |
相关术语
字符属性 中的更多内容
字符首次被分配时所在的Unicode版本,有助于判断各系统和软件版本的字符支持情况。
Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …
Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …
Unicode property listing all scripts that use a character, broader than the …
将每个码位归入30个类别(Lu、Ll、Nd、So等)之一的分类体系,分为7大类:字母、标记、数字、标点、符号、分隔符和其他。
具有相同抽象内容但外观可能不同的两个字符序列,比规范等价更宽泛,例如fi ≈ fi,² ≈ 2。
将字符映射为其组成部分的过程。规范分解保留语义(é → e + ◌́),兼容分解可能改变语义(fi → fi)。
命名的连续码位范围(如基本拉丁文 = U+0000–U+007F)。Unicode 16.0定义了336个区块,每个码位恰好属于一个区块。
决定字符在双向文本中(LTR、RTL、弱、中性)行为方式的属性,由Unicode双向算法用于确定显示顺序。
由于稳定性策略规定Unicode名称不可更改,因此提供字符的备用名称,用于更正、缩写和别名。