What is ระบบการเขียน?

ระบบการเขียนที่อักขระนั้นสังกัด (เช่น Latin, Cyrillic, Han) Unicode 16.0 กำหนด 168 อักษร คุณสมบัติ Script มีความสำคัญสำหรับความปลอดภัยและการตรวจจับอักษรผสม

What is หมวดหมู่ทั่วไป?

การจัดประเภทจุดรหัสทุกจุดเป็นหนึ่งใน 30 หมวดหมู่ (Lu, Ll, Nd, So ฯลฯ) จัดกลุ่มเป็น 7 คลาสหลัก: ตัวอักษร เครื่องหมาย ตัวเลข เครื่องหมายวรรคตอน สัญลักษณ์ ตัวแบ่ง และอื่นๆ

What is Unicode Character Database (UCD)?

คอลเลกชันไฟล์ข้อมูลที่อ่านได้ด้วยเครื่องซึ่งกำหนดคุณสมบัติอักขระ Unicode ทั้งหมด รวมถึง UnicodeData.txt, Blocks.txt, Scripts.txt และอื่นๆ

คุณสมบัติ

บล็อก

ช่วงจุดรหัสต่อเนื่องกันที่มีชื่อ (เช่น Basic Latin = U+0000–U+007F) Unicode 16.0 กำหนด 336 บล็อก ทุกจุดรหัสอยู่ในหนึ่งบล็อกพอดี

2022-01-10 · Updated 2024-07-29

What Is a Unicode Block?

A Unicode block is a named, contiguous range of code points assigned to a related group of characters. The Unicode Standard divides the entire code point space (U+0000 to U+10FFFF) into 336 blocks, each a multiple of 16 code points in size. Block boundaries are fixed—they never change between Unicode versions—though new blocks may be allocated in previously unassigned ranges.

Each block has a distinctive name that broadly describes its contents: Basic Latin (U+0000–U+007F), Greek and Coptic (U+0370–U+03FF), or CJK Unified Ideographs (U+4E00–U+9FFF). The name is informational rather than prescriptive, so a block may contain characters from multiple scripts, or even unassigned code points.

Blocks vs. Scripts

Blocks are purely positional: a character belongs to exactly one block based on its numeric value. Scripts, by contrast, reflect linguistic or cultural affiliation. The Latin Extended Additional block (U+1E00–U+1EFF) contains Latin characters, but the Letterlike Symbols block (U+2100–U+214F) holds characters from many scripts such as ℂ (DOUBLE-STRUCK CAPITAL C) and ℓ (SCRIPT SMALL L). A single block can span multiple scripts, and one script can span multiple blocks.

import unicodedata

# Look up the block of a character using the Unicode data utilities
# Python's unicodedata module does not expose block directly,
# but you can derive it from the character name prefix or use the
# 'unicode_data' third-party package.

for char in ["A", "α", "中", "😀"]:
    name = unicodedata.name(char, "<unnamed>")
    cp = ord(char)
    print(f"U+{cp:04X}  {char}  {name}")

# U+0041  A  LATIN CAPITAL LETTER A       → Basic Latin
# U+03B1  α  GREEK SMALL LETTER ALPHA     → Greek and Coptic
# U+4E2D  中 CJK UNIFIED IDEOGRAPH-4E2D   → CJK Unified Ideographs
# U+1F600 😀 GRINNING FACE                → Emoticons

Why Blocks Matter

Blocks appear in Unicode Character Ranges used by CSS (@font-face unicode-range), regular expressions (\p{Block=CJK_Unified_Ideographs} in Perl, PCRE, or Java), and font subsetting tools. Knowing a character's block helps font engineers decide which code-point ranges to include in a subset, reducing file size while preserving coverage.

Block assignments also guide rendering engines. For example, shaping engines like HarfBuzz use block membership as one heuristic when selecting a shaping script when no explicit script tag is available.

Quick Facts

Property	Value
Unicode property name	`Block`
Short property alias	`blk`
Number of defined blocks (Unicode 15.1)	336
Smallest block	16 code points (many Supplement blocks)
Largest block	CJK Unified Ideographs, 20,902 assigned (U+4E00–U+9FFF, 8,192 range)
Python access	No built-in; use `unicodedata.name()` + range lookup or `unicodeblock` package
Regex syntax	`\p{Block=Basic_Latin}` (PCRE/Perl/Java)
Spec reference	Unicode Standard Annex #44, `Blocks.txt`

คำศัพท์ที่เกี่ยวข้อง

ระบบการเขียน หมวดหมู่ทั่วไป Unicode Character Database (UCD)

เพิ่มเติมใน คุณสมบัติ

East Asian Width

Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …

Joining Type

Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …

Script Extensions

Unicode property listing all scripts that use a character, broader than the …

กลุ่มกราฟีม

อักขระที่ผู้ใช้รับรู้ได้ — สิ่งที่รู้สึกเหมือนหน่วยเดียว อาจประกอบด้วยหลายจุดรหัส (ฐาน + เครื่องหมายรวม หรือลำดับ emoji ZWJ) 👩‍💻 = …

การแมปตัวพิมพ์

กฎสำหรับแปลงอักขระระหว่างตัวพิมพ์ใหญ่ ตัวพิมพ์เล็ก และตัวพิมพ์หัวเรื่อง อาจขึ้นอยู่กับ locale (ปัญหาตัว I ในภาษาตุรกี) และอาจเป็นแบบหนึ่ง-ต่อ-หลาย (ß → SS)

การแยกส่วน

การแมปอักขระเป็นส่วนประกอบย่อย การแยกส่วนแบบ canonical รักษาความหมาย (é → e + ́) ในขณะที่การแยกส่วนแบบ compatibility อาจเปลี่ยนความหมาย …

คลาสการรวม

ค่าตัวเลข (0–254) ที่ควบคุมลำดับของเครื่องหมายรวมระหว่างการแยกส่วนแบบ canonical กำหนดว่าเครื่องหมายรวมใดสามารถเรียงลำดับใหม่ได้

ความสมมูลความเข้ากันได้

ลำดับอักขระสองชุดที่มีเนื้อหาเชิงนามธรรมเดียวกันแต่อาจแตกต่างในรูปลักษณ์ กว้างกว่าความเท่าเทียมแบบ canonical ตัวอย่าง: ﬁ ≈ fi, ² ≈ 2

ความสมมูลมาตรฐาน

ลำดับอักขระสองชุดที่มีความหมายเหมือนกันและควรถือว่าเท่าเทียมกัน ตัวอย่าง: é (U+00E9) ≡ e + ◌́ (U+0065 + U+0301)

คุณสมบัติการสะท้อน

อักขระที่รูปร่างควรสะท้อนในแนวนอนในบริบท RTL ตัวอย่าง: ( → ), [ → ], { → }, …

← กลับไปยังอภิธานศัพท์