What is บล็อก?

ช่วงจุดรหัสต่อเนื่องกันที่มีชื่อ (เช่น Basic Latin = U+0000–U+007F) Unicode 16.0 กำหนด 336 บล็อก ทุกจุดรหัสอยู่ในหนึ่งบล็อกพอดี

What is ระบบการเขียน?

ระบบการเขียนที่อักขระนั้นสังกัด (เช่น Latin, Cyrillic, Han) Unicode 16.0 กำหนด 168 อักษร คุณสมบัติ Script มีความสำคัญสำหรับความปลอดภัยและการตรวจจับอักษรผสม

What is หมวดหมู่ทั่วไป?

การจัดประเภทจุดรหัสทุกจุดเป็นหนึ่งใน 30 หมวดหมู่ (Lu, Ll, Nd, So ฯลฯ) จัดกลุ่มเป็น 7 คลาสหลัก: ตัวอักษร เครื่องหมาย ตัวเลข เครื่องหมายวรรคตอน สัญลักษณ์ ตัวแบ่ง และอื่นๆ

มาตรฐาน Unicode

Unicode Character Database (UCD)

คอลเลกชันไฟล์ข้อมูลที่อ่านได้ด้วยเครื่องซึ่งกำหนดคุณสมบัติอักขระ Unicode ทั้งหมด รวมถึง UnicodeData.txt, Blocks.txt, Scripts.txt และอื่นๆ

2021-07-12 · Updated 2024-03-25

What is the Unicode Character Database?

The Unicode Character Database (UCD) is the authoritative, machine-readable repository of all properties for every Unicode code point. Where the Unicode Standard describes characters in prose and tables, the UCD provides that same information in structured data files that software libraries can parse and implement automatically. Every Unicode library — ICU, Python's unicodedata, Java's Character class, .NET's CharUnicodeInfo — is built from the UCD.

The UCD is distributed as a collection of plain-text files published on unicode.org for each Unicode version. The files follow documented formats (simple tables, two-column mappings, or the comprehensive UnicodeData.txt) and are freely available for any use.

Core UCD Files

File	Description
`UnicodeData.txt`	One line per assigned code point: name, category, combining class, bidi class, decomposition, numeric values, case mappings
`PropList.txt`	Boolean properties like `White_Space`, `Dash`, `Diacritic`, `Extender`
`DerivedCoreProperties.txt`	Derived properties like `Alphabetic`, `Math`, `ID_Start`, `ID_Continue`
`Blocks.txt`	Block name → code point range mapping
`Scripts.txt`	Script assignment for each code point (Latin, Arabic, Han, etc.)
`EmojiData.txt`	Emoji-specific properties: `Emoji`, `Emoji_Presentation`, `Emoji_Modifier`
`NameAliases.txt`	Formal aliases (corrections, abbreviations, alternate names)
`CaseFolding.txt`	Case-insensitive comparison mappings
`NormalizationTest.txt`	Test vectors for NFC/NFD/NFKC/NFKD implementations
`CompositionExclusions.txt`	Code points excluded from canonical composition

UnicodeData.txt Format

The most important UCD file is UnicodeData.txt. Each line has 15 semicolon-delimited fields:

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
│    │                       │  │ │
│    │                       │  │ └── Bidi class (L = Left-to-right)
│    │                       │  └──── Canonical combining class (0 = not combining)
│    │                       └─────── General category (Lu = Uppercase Letter)
│    └─────────────────────────────── Character name
└──────────────────────────────────── Code point (hex)

The 15 fields in order: 1. Code point (hex) 2. Character name 3. General category 4. Canonical combining class 5. Bidi class 6. Character decomposition 7–9. Numeric values (decimal, digit, numeric) 10. Mirror flag 11. Unicode 1.0 name (legacy) 12. ISO comment (deprecated) 13–15. Uppercase, lowercase, titlecase mappings

Using the UCD in Practice

import unicodedata

# Python's unicodedata module is a UCD interface
char = "A"
print(unicodedata.name(char))           # LATIN CAPITAL LETTER A
print(unicodedata.category(char))       # Lu (Uppercase Letter)
print(unicodedata.bidirectional(char))  # L
print(unicodedata.combining(char))      # 0 (non-combining)
print(unicodedata.normalize("NFD", "é")) # e + combining accent

# Reading UnicodeData.txt directly
import urllib.request

url = "https://unicode.org/Public/UCD/latest/ucd/UnicodeData.txt"
with urllib.request.urlopen(url) as f:
    for line in f:
        fields = line.decode().strip().split(";")
        cp, name, category = fields[0], fields[1], fields[2]
        if category == "So":  # Other Symbol
            print(f"U+{cp}: {name}")

Common Pitfalls

Name vs Alias: UnicodeData.txt lists the normative name, but some characters have aliases (corrections to historical naming errors). For example, U+FE18 is officially named PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET (note the typo), but the alias file provides the corrected name.

Ranges in UnicodeData.txt: CJK Unified Ideographs are not listed individually. Instead, two lines mark the start and end of the range:

4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FFF;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;

Quick Facts

Property	Value
Primary file	`UnicodeData.txt`
Total files in UCD	~60 files
URL	unicode.org/Public/UCD/latest/ucd/
License	Unicode License (free, attribution required)
Update frequency	Each Unicode version
Key consumer	ICU, Python unicodedata, Java Character, .NET
Fields in UnicodeData.txt	15 per line

คำศัพท์ที่เกี่ยวข้อง

Unicode บล็อก ระบบการเขียน หมวดหมู่ทั่วไป

เพิ่มเติมใน มาตรฐาน Unicode

Basic Multilingual Plane (BMP)

ระนาบ 0 (U+0000–U+FFFF) ประกอบด้วยอักขระที่ใช้บ่อยที่สุด ได้แก่ Latin, Greek, Cyrillic, CJK, Arabic และสัญลักษณ์ส่วนใหญ่ อักขระในระนาบนี้พอดีกับหนึ่งหน่วยรหัส …

CJK

จีน ญี่ปุ่น และเกาหลี คำรวมสำหรับบล็อกอักษรจีน Han ที่รวมกันและอักษรที่เกี่ยวข้องใน Unicode CJK Unified Ideographs มีอักขระมากกว่า 20,992 …

Han Unification

The process of mapping Chinese, Japanese, and Korean ideographs that share a …

Hangul Jamo

The individual consonant and vowel components (jamo) of the Korean Hangul writing …

ISO 10646 / Universal Character Set

มาตรฐานสากล (ISO/IEC 10646) ที่ซิงโครไนซ์กับ Unicode กำหนดชุดอักขระและจุดรหัสเดียวกัน แต่ไม่มีอัลกอริธึมและคุณสมบัติเพิ่มเติมของ Unicode

Unicode

มาตรฐานการเข้ารหัสอักขระสากลที่กำหนดหมายเลขเฉพาะ (จุดรหัส) ให้กับทุกอักขระในทุกระบบการเขียน เวอร์ชัน 16.0 มีอักขระที่กำหนดแล้ว 154,998 ตัว

Unicode Standard Annex (UAX)

Normative or informative documents that are integral parts of the Unicode Standard. …

Unicode Technical Report (UTR)

Informational documents published by the Unicode Consortium covering specific topics like security …

ค่าสเกลาร์ Unicode

จุดรหัสใดๆ ยกเว้นจุดรหัส surrogate (U+D800–U+DFFF) ชุดค่าที่ถูกต้องซึ่งสามารถแทนอักขระจริงได้ รวมทั้งสิ้น 1,112,064 ค่า

จุดรหัส

ค่าตัวเลขในพื้นที่รหัส Unicode (U+0000 ถึง U+10FFFF) เขียนในรูปแบบ U+XXXX ไม่ใช่ทุกจุดรหัสที่จะถูกกำหนดให้กับอักขระ

← กลับไปยังอภิธานศัพท์