Plage contiguë nommée de points de code (par ex. Basic Latin = U+0000–U+007F). Unicode 16.0 définit 336 blocs ; chaque point de code appartient à exactement un bloc.

What is Catégorie générale?

Classification de chaque point de code dans l'une des 30 catégories (Lu, Ll, Nd, So, etc.) regroupées en 7 classes principales : Lettre, Marque, Nombre, Ponctuation, Symbole, Séparateur, Autre.

What is Caractère confusable?

Le terme officiel d'Unicode pour les paires de caractères pouvant être confondues visuellement, définis dans confusables.txt (UCD). Plus large que les homoglyphes — inclut les caractères simplement similaires, pas seulement identiques.

Propriétés

Système d'écriture

Le système d'écriture auquel appartient un caractère (par ex. latin, cyrillique, han). Unicode 16.0 définit 168 scripts ; la propriété Script est essentielle pour la sécurité et la détection d'écritures mixtes.

2022-01-19 · Updated 2024-05-06

What Is a Unicode Script?

A Unicode Script is a collection of characters used to write one or more human languages. Unlike blocks (which are contiguous code-point ranges), a script groups characters by their cultural and historical writing system: Latin, Arabic, Han, Devanagari, Georgian, and so on. Unicode 15.1 defines 161 scripts.

Every assigned character carries a Script property value. Characters not associated with any specific writing system receive the value Common (punctuation, digits, emoji) or Inherited (combining marks that inherit the script of their base character, such as combining diacritical marks).

Script vs. Block

The distinction is important in practice:

The Latin script spans dozens of blocks: Basic Latin, Latin-1 Supplement, Latin Extended-A through Latin Extended-G, IPA Extensions, and more.
The CJK Unified Ideographs block contains characters that belong to multiple scripts (Han, and historically Bopomofo components).
The Letterlike Symbols block is Script=Common because those symbols are used across many writing systems.

# Python 3.14+ exposes Script via unicodedata
import unicodedata

# unicodedata.script() — available in Python 3.14
for char in ["A", "α", "ب", "あ", "中"]:
    try:
        script = unicodedata.script(char)
    except AttributeError:
        script = "(requires Python 3.14)"
    print(f"{char}  Script={script}")

# A  Script=Latin
# α  Script=Greek
# ب  Script=Arabic
# あ Script=Hiragana
# 中 Script=Han

# On older Python, use the 'regex' package:
import regex
print(bool(regex.match(r'\p{Script=Latin}', 'A')))     # True
print(bool(regex.match(r'\p{Script=Arabic}', 'ب')))    # True

Script Extensions

Some characters are legitimately used in more than one script. The Script_Extensions property lists all scripts that use a given character. For example, U+0951 DEVANAGARI STRESS SIGN UDATTA appears in Devanagari, Bengali, Gujarati, and a dozen other Indic scripts—its Script is Inherited, but its Script_Extensions lists all the scripts that employ it. Implementations that need precise script-segmentation should consult Script_Extensions rather than Script alone.

# regex package supports Script_Extensions:
import regex
# Match a character used in the Devanagari OR Bengali script
pattern = regex.compile(r'[\p{Script_Extensions=Devanagari}\p{Script_Extensions=Bengali}]')

Quick Facts

Property	Value
Unicode property name	`Script`
Short alias	`sc`
Number of scripts (Unicode 15.1)	161
Special values	`Common`, `Inherited`, `Unknown`
Python 3.14	`unicodedata.script(char)`
Older Python	`regex` package, `\p{Script=Latin}`
Companion property	`Script_Extensions` (`scx`)
Spec reference	Unicode Standard Annex #24 (UAX #24)

Termes associés

Bloc Catégorie générale Caractère confusable

Plus dans Propriétés

Alias de nom

Noms alternatifs pour les caractères, les noms Unicode ne pouvant pas changer …

Bloc

Plage contiguë nommée de points de code (par ex. Basic Latin = …

Catégorie bidirectionnelle

Propriété déterminant le comportement d'un caractère dans un texte bidirectionnel (LTR, RTL, …

Catégorie générale

Classification de chaque point de code dans l'une des 30 catégories (Lu, …

Classe de combinaison

Valeur numérique (0–254) contrôlant l'ordre des marques combinantes lors de la décomposition …

Correspondance de casse

Règles de conversion des caractères entre majuscules, minuscules et casse de titre. …

Décomposition

La décomposition d'un caractère en ses éléments constitutifs. La décomposition canonique préserve …

East Asian Width

Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …

Grappe de graphèmes

Le « caractère » perçu par l'utilisateur — ce qui ressemble à …

Ignorable par défaut

Caractères ne devant avoir aucun effet visible et pouvant être ignorés par …

← Retour au glossaire