What is Sistem tulisan?

Sistem penulisan yang menjadi milik suatu karakter (misalnya, Latin, Sirilik, Han). Unicode 16.0 mendefinisikan 168 skrip; properti Script penting untuk keamanan dan deteksi skrip campuran.

What is Kategori umum?

Klasifikasi setiap titik kode ke dalam salah satu dari 30 kategori (Lu, Ll, Nd, So, dll.) yang dikelompokkan menjadi 7 kelas utama: Huruf, Tanda, Angka, Tanda Baca, Simbol, Pemisah, Lainnya.

What is Unicode Character Database (UCD)?

Kumpulan file data yang dapat dibaca mesin yang mendefinisikan semua properti karakter Unicode, termasuk UnicodeData.txt, Blocks.txt, Scripts.txt, dan banyak lagi.

Properti

Blok

Rentang titik kode berurutan yang dinamai (misalnya, Basic Latin = U+0000–U+007F). Unicode 16.0 mendefinisikan 336 blok; setiap titik kode termasuk dalam tepat satu blok.

2022-01-10 · Updated 2024-07-29

What Is a Unicode Block?

A Unicode block is a named, contiguous range of code points assigned to a related group of characters. The Unicode Standard divides the entire code point space (U+0000 to U+10FFFF) into 336 blocks, each a multiple of 16 code points in size. Block boundaries are fixed—they never change between Unicode versions—though new blocks may be allocated in previously unassigned ranges.

Each block has a distinctive name that broadly describes its contents: Basic Latin (U+0000–U+007F), Greek and Coptic (U+0370–U+03FF), or CJK Unified Ideographs (U+4E00–U+9FFF). The name is informational rather than prescriptive, so a block may contain characters from multiple scripts, or even unassigned code points.

Blocks vs. Scripts

Blocks are purely positional: a character belongs to exactly one block based on its numeric value. Scripts, by contrast, reflect linguistic or cultural affiliation. The Latin Extended Additional block (U+1E00–U+1EFF) contains Latin characters, but the Letterlike Symbols block (U+2100–U+214F) holds characters from many scripts such as ℂ (DOUBLE-STRUCK CAPITAL C) and ℓ (SCRIPT SMALL L). A single block can span multiple scripts, and one script can span multiple blocks.

import unicodedata

# Look up the block of a character using the Unicode data utilities
# Python's unicodedata module does not expose block directly,
# but you can derive it from the character name prefix or use the
# 'unicode_data' third-party package.

for char in ["A", "α", "中", "😀"]:
    name = unicodedata.name(char, "<unnamed>")
    cp = ord(char)
    print(f"U+{cp:04X}  {char}  {name}")

# U+0041  A  LATIN CAPITAL LETTER A       → Basic Latin
# U+03B1  α  GREEK SMALL LETTER ALPHA     → Greek and Coptic
# U+4E2D  中 CJK UNIFIED IDEOGRAPH-4E2D   → CJK Unified Ideographs
# U+1F600 😀 GRINNING FACE                → Emoticons

Why Blocks Matter

Blocks appear in Unicode Character Ranges used by CSS (@font-face unicode-range), regular expressions (\p{Block=CJK_Unified_Ideographs} in Perl, PCRE, or Java), and font subsetting tools. Knowing a character's block helps font engineers decide which code-point ranges to include in a subset, reducing file size while preserving coverage.

Block assignments also guide rendering engines. For example, shaping engines like HarfBuzz use block membership as one heuristic when selecting a shaping script when no explicit script tag is available.

Quick Facts

Property	Value
Unicode property name	`Block`
Short property alias	`blk`
Number of defined blocks (Unicode 15.1)	336
Smallest block	16 code points (many Supplement blocks)
Largest block	CJK Unified Ideographs, 20,902 assigned (U+4E00–U+9FFF, 8,192 range)
Python access	No built-in; use `unicodedata.name()` + range lookup or `unicodeblock` package
Regex syntax	`\p{Block=Basic_Latin}` (PCRE/Perl/Java)
Spec reference	Unicode Standard Annex #44, `Blocks.txt`

Istilah Terkait

Sistem tulisan Kategori umum Unicode Character Database (UCD)

Lainnya di Properti

Alias nama

Nama alternatif untuk karakter, karena nama Unicode tidak dapat diubah sesuai kebijakan …

Dapat diabaikan secara default

Karakter yang tidak memiliki efek visual dan dapat diabaikan oleh proses yang …

Dekomposisi

Pemetaan karakter ke bagian-bagian komponennya. Dekomposisi kanonik mempertahankan makna (é → e …

East Asian Width

Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …

Joining Type

Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …

Kategori dua arah

Properti yang menentukan bagaimana karakter berperilaku dalam teks dua arah (LTR, RTL, …

Kategori umum

Klasifikasi setiap titik kode ke dalam salah satu dari 30 kategori (Lu, …

Kelas penggabungan

Nilai numerik (0–254) yang mengontrol pengurutan tanda penggabung selama dekomposisi kanonik, menentukan …

Kesetaraan kanonik

Dua urutan karakter yang secara semantik identik dan harus diperlakukan sama. Contoh: …

Kesetaraan kompatibilitas

Dua urutan karakter dengan konten abstrak yang sama yang mungkin berbeda dalam …

← Kembali ke Glosarium