Rango contiguo de puntos de código con nombre (por ejemplo, Basic Latin = U+0000–U+007F). Unicode 16.0 define 336 bloques; cada punto de código pertenece exactamente a un bloque.

What is Categoría general?

Clasificación de cada punto de código en una de 30 categorías (Lu, Ll, Nd, So, etc.) agrupadas en 7 clases principales: Letra, Marca, Número, Puntuación, Símbolo, Separador, Otro.

What is Carácter confundible?

El término oficial de Unicode para pares de caracteres que pueden confundirse visualmente, definidos en confusables.txt (UCD). Más amplio que los homoglifos — incluye caracteres meramente similares, no solo idénticos.

Propiedades

Sistema de escritura

El sistema de escritura al que pertenece un carácter (por ejemplo, Latin, Cyrillic, Han). Unicode 16.0 define 168 scripts; la propiedad Script es clave para la seguridad y la detección de scripts mixtos.

2022-01-19 · Updated 2024-05-06

What Is a Unicode Script?

A Unicode Script is a collection of characters used to write one or more human languages. Unlike blocks (which are contiguous code-point ranges), a script groups characters by their cultural and historical writing system: Latin, Arabic, Han, Devanagari, Georgian, and so on. Unicode 15.1 defines 161 scripts.

Every assigned character carries a Script property value. Characters not associated with any specific writing system receive the value Common (punctuation, digits, emoji) or Inherited (combining marks that inherit the script of their base character, such as combining diacritical marks).

Script vs. Block

The distinction is important in practice:

The Latin script spans dozens of blocks: Basic Latin, Latin-1 Supplement, Latin Extended-A through Latin Extended-G, IPA Extensions, and more.
The CJK Unified Ideographs block contains characters that belong to multiple scripts (Han, and historically Bopomofo components).
The Letterlike Symbols block is Script=Common because those symbols are used across many writing systems.

# Python 3.14+ exposes Script via unicodedata
import unicodedata

# unicodedata.script() — available in Python 3.14
for char in ["A", "α", "ب", "あ", "中"]:
    try:
        script = unicodedata.script(char)
    except AttributeError:
        script = "(requires Python 3.14)"
    print(f"{char}  Script={script}")

# A  Script=Latin
# α  Script=Greek
# ب  Script=Arabic
# あ Script=Hiragana
# 中 Script=Han

# On older Python, use the 'regex' package:
import regex
print(bool(regex.match(r'\p{Script=Latin}', 'A')))     # True
print(bool(regex.match(r'\p{Script=Arabic}', 'ب')))    # True

Script Extensions

Some characters are legitimately used in more than one script. The Script_Extensions property lists all scripts that use a given character. For example, U+0951 DEVANAGARI STRESS SIGN UDATTA appears in Devanagari, Bengali, Gujarati, and a dozen other Indic scripts—its Script is Inherited, but its Script_Extensions lists all the scripts that employ it. Implementations that need precise script-segmentation should consult Script_Extensions rather than Script alone.

# regex package supports Script_Extensions:
import regex
# Match a character used in the Devanagari OR Bengali script
pattern = regex.compile(r'[\p{Script_Extensions=Devanagari}\p{Script_Extensions=Bengali}]')

Quick Facts

Property	Value
Unicode property name	`Script`
Short alias	`sc`
Number of scripts (Unicode 15.1)	161
Special values	`Common`, `Inherited`, `Unknown`
Python 3.14	`unicodedata.script(char)`
Older Python	`regex` package, `\p{Script=Latin}`
Companion property	`Script_Extensions` (`scx`)
Spec reference	Unicode Standard Annex #24 (UAX #24)

Términos relacionados

Bloque Categoría general Carácter confundible

Más en Propiedades

Alias de nombre

Nombres alternativos para los caracteres, ya que los nombres de Unicode no …

Bloque

Rango contiguo de puntos de código con nombre (por ejemplo, Basic Latin …

Categoría bidireccional

Propiedad que determina cómo se comporta un carácter en texto bidireccional (LTR, …

Categoría general

Clasificación de cada punto de código en una de 30 categorías (Lu, …

Clase de combinación

Valor numérico (0–254) que controla el orden de los signos combinantes durante …

Clúster de grafemas

El «carácter» percibido por el usuario: lo que parece una sola unidad. …

Conversión de mayúsculas y minúsculas

Reglas para convertir caracteres entre mayúsculas, minúsculas y versalitas. Puede depender del …

Descomposición

La descomposición de un carácter en sus partes componentes. La descomposición canónica …

East Asian Width

Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …

Equivalencia canónica

Dos secuencias de caracteres que son semánticamente idénticas y deben tratarse como …

← Volver al glosario