What is 문자 체계?

문자가 속한 문자 체계(예: 라틴, 키릴, 한자). Unicode 16.0은 168개의 문자 체계를 정의하며, Script 속성은 보안 및 혼합 문자 체계 감지에 핵심입니다.

What is 일반 범주?

모든 코드 포인트를 7개 주요 분류(문자, 기호, 숫자, 구두점, 기호, 구분자, 기타)로 나뉜 30개 범주(Lu, Ll, Nd, So 등) 중 하나로 분류하는 체계.

What is 유니코드 문자 데이터베이스 (UCD)?

모든 유니코드 문자 속성을 정의하는 기계 판독 가능한 데이터 파일 모음으로, UnicodeData.txt, Blocks.txt, Scripts.txt 등이 포함됩니다.

속성

블록

명명된 연속 코드 포인트 범위(예: 기본 라틴 = U+0000~U+007F). Unicode 16.0은 336개 블록을 정의하며, 모든 코드 포인트는 정확히 하나의 블록에 속합니다.

2022-01-10 · Updated 2024-07-29

What Is a Unicode Block?

A Unicode block is a named, contiguous range of code points assigned to a related group of characters. The Unicode Standard divides the entire code point space (U+0000 to U+10FFFF) into 336 blocks, each a multiple of 16 code points in size. Block boundaries are fixed—they never change between Unicode versions—though new blocks may be allocated in previously unassigned ranges.

Each block has a distinctive name that broadly describes its contents: Basic Latin (U+0000–U+007F), Greek and Coptic (U+0370–U+03FF), or CJK Unified Ideographs (U+4E00–U+9FFF). The name is informational rather than prescriptive, so a block may contain characters from multiple scripts, or even unassigned code points.

Blocks vs. Scripts

Blocks are purely positional: a character belongs to exactly one block based on its numeric value. Scripts, by contrast, reflect linguistic or cultural affiliation. The Latin Extended Additional block (U+1E00–U+1EFF) contains Latin characters, but the Letterlike Symbols block (U+2100–U+214F) holds characters from many scripts such as ℂ (DOUBLE-STRUCK CAPITAL C) and ℓ (SCRIPT SMALL L). A single block can span multiple scripts, and one script can span multiple blocks.

import unicodedata

# Look up the block of a character using the Unicode data utilities
# Python's unicodedata module does not expose block directly,
# but you can derive it from the character name prefix or use the
# 'unicode_data' third-party package.

for char in ["A", "α", "中", "😀"]:
    name = unicodedata.name(char, "<unnamed>")
    cp = ord(char)
    print(f"U+{cp:04X}  {char}  {name}")

# U+0041  A  LATIN CAPITAL LETTER A       → Basic Latin
# U+03B1  α  GREEK SMALL LETTER ALPHA     → Greek and Coptic
# U+4E2D  中 CJK UNIFIED IDEOGRAPH-4E2D   → CJK Unified Ideographs
# U+1F600 😀 GRINNING FACE                → Emoticons

Why Blocks Matter

Blocks appear in Unicode Character Ranges used by CSS (@font-face unicode-range), regular expressions (\p{Block=CJK_Unified_Ideographs} in Perl, PCRE, or Java), and font subsetting tools. Knowing a character's block helps font engineers decide which code-point ranges to include in a subset, reducing file size while preserving coverage.

Block assignments also guide rendering engines. For example, shaping engines like HarfBuzz use block membership as one heuristic when selecting a shaping script when no explicit script tag is available.

Quick Facts

Property	Value
Unicode property name	`Block`
Short property alias	`blk`
Number of defined blocks (Unicode 15.1)	336
Smallest block	16 code points (many Supplement blocks)
Largest block	CJK Unified Ideographs, 20,902 assigned (U+4E00–U+9FFF, 8,192 range)
Python access	No built-in; use `unicodedata.name()` + range lookup or `unicodeblock` package
Regex syntax	`\p{Block=Basic_Latin}` (PCRE/Perl/Java)
Spec reference	Unicode Standard Annex #44, `Blocks.txt`