What is लिपि?

वह लिपि जिससे वर्ण संबंधित है (जैसे, Latin, Cyrillic, Han)। Unicode 16.0 में 168 scripts परिभाषित हैं; Script प्रॉपर्टी सुरक्षा और मिश्रित-लिपि पहचान के लिए महत्वपूर्ण है।

What is सामान्य श्रेणी?

प्रत्येक कोड पॉइंट का 30 श्रेणियों (Lu, Ll, Nd, So, आदि) में वर्गीकरण जो 7 प्रमुख वर्गों में समूहीकृत हैं: Letter, Mark, Number, Punctuation, Symbol, Separator, Other।

What is Unicode Character Database (UCD)?

डेटा फाइलों का machine-readable संग्रह जो सभी Unicode अक्षर गुणों को परिभाषित करता है, जिसमें UnicodeData.txt, Blocks.txt, Scripts.txt और कई अन्य शामिल हैं।

गुणधर्म

ब्लॉक

कोड पॉइंट्स की एक नामित सन्निकट श्रृंखला (जैसे, Basic Latin = U+0000–U+007F)। Unicode 16.0 में 336 blocks परिभाषित हैं; प्रत्येक कोड पॉइंट ठीक एक block से संबंधित है।

2022-01-10 · Updated 2024-07-29

What Is a Unicode Block?

A Unicode block is a named, contiguous range of code points assigned to a related group of characters. The Unicode Standard divides the entire code point space (U+0000 to U+10FFFF) into 336 blocks, each a multiple of 16 code points in size. Block boundaries are fixed—they never change between Unicode versions—though new blocks may be allocated in previously unassigned ranges.

Each block has a distinctive name that broadly describes its contents: Basic Latin (U+0000–U+007F), Greek and Coptic (U+0370–U+03FF), or CJK Unified Ideographs (U+4E00–U+9FFF). The name is informational rather than prescriptive, so a block may contain characters from multiple scripts, or even unassigned code points.

Blocks vs. Scripts

Blocks are purely positional: a character belongs to exactly one block based on its numeric value. Scripts, by contrast, reflect linguistic or cultural affiliation. The Latin Extended Additional block (U+1E00–U+1EFF) contains Latin characters, but the Letterlike Symbols block (U+2100–U+214F) holds characters from many scripts such as ℂ (DOUBLE-STRUCK CAPITAL C) and ℓ (SCRIPT SMALL L). A single block can span multiple scripts, and one script can span multiple blocks.

import unicodedata

# Look up the block of a character using the Unicode data utilities
# Python's unicodedata module does not expose block directly,
# but you can derive it from the character name prefix or use the
# 'unicode_data' third-party package.

for char in ["A", "α", "中", "😀"]:
    name = unicodedata.name(char, "<unnamed>")
    cp = ord(char)
    print(f"U+{cp:04X}  {char}  {name}")

# U+0041  A  LATIN CAPITAL LETTER A       → Basic Latin
# U+03B1  α  GREEK SMALL LETTER ALPHA     → Greek and Coptic
# U+4E2D  中 CJK UNIFIED IDEOGRAPH-4E2D   → CJK Unified Ideographs
# U+1F600 😀 GRINNING FACE                → Emoticons

Why Blocks Matter

Blocks appear in Unicode Character Ranges used by CSS (@font-face unicode-range), regular expressions (\p{Block=CJK_Unified_Ideographs} in Perl, PCRE, or Java), and font subsetting tools. Knowing a character's block helps font engineers decide which code-point ranges to include in a subset, reducing file size while preserving coverage.

Block assignments also guide rendering engines. For example, shaping engines like HarfBuzz use block membership as one heuristic when selecting a shaping script when no explicit script tag is available.

Quick Facts

Property	Value
Unicode property name	`Block`
Short property alias	`blk`
Number of defined blocks (Unicode 15.1)	336
Smallest block	16 code points (many Supplement blocks)
Largest block	CJK Unified Ideographs, 20,902 assigned (U+4E00–U+9FFF, 8,192 range)
Python access	No built-in; use `unicodedata.name()` + range lookup or `unicodeblock` package
Regex syntax	`\p{Block=Basic_Latin}` (PCRE/Perl/Java)
Spec reference	Unicode Standard Annex #44, `Blocks.txt`