名称别名
由于稳定性策略规定Unicode名称不可更改,因此提供字符的备用名称,用于更正、缩写和别名。
What Are Name Aliases?
A Name Alias is an alternate, officially recognized name for a Unicode character. While every assigned character has a formal Name property (or a generated name like <CJK UNIFIED IDEOGRAPH-4E2D>), some characters have additional aliases for several reasons:
- Correction: The formal name contains a historical error that cannot be changed (Unicode names are immutable once published), so a corrected name is provided as a
Correctionalias. - Control code names: Characters in the C0 and C1 control ranges (U+0000–U+001F, U+007F–U+009F) have formal names like
NULLor no readable name at all; their familiar abbreviations (NUL,LF,CR,DEL) are registered asControlaliases. - Abbreviations: Widely used short names like
ZWSP(for ZERO WIDTH SPACE) orBOM(for BYTE ORDER MARK). - Figments: Names that appeared in published Unicode data due to errors and were then retracted.
The BOM Case Study
One of the most instructive examples is U+FEFF:
- Formal name:
ZERO WIDTH NO-BREAK SPACE - Name alias (Abbreviation):
ZWNBSP - Name alias (Alternate):
BYTE ORDER MARK - Name alias (Abbreviation):
BOM
The name ZERO WIDTH NO-BREAK SPACE is the historical, immutable name. The BOM function—indicating byte order in UTF-16/UTF-32 streams—was added later, but Unicode names cannot be changed. The alias BYTE ORDER MARK documents the actual common use.
import unicodedata
# unicodedata.name() returns the formal name only
print(unicodedata.name("\uFEFF"))
# ZERO WIDTH NO-BREAK SPACE
# unicodedata.lookup() works with both formal names and aliases
bom_by_alias = unicodedata.lookup("BYTE ORDER MARK")
print(f"U+{ord(bom_by_alias):04X}")
# U+FEFF
# Control character aliases
nul = unicodedata.lookup("NUL") # U+0000
cr = unicodedata.lookup("CARRIAGE RETURN") # U+000D
lf = unicodedata.lookup("LINE FEED") # U+000A
print(ord(nul), ord(cr), ord(lf))
# 0 13 10
Alias Types
The Unicode Standard defines five alias types:
| Type | Description | Example |
|---|---|---|
correction |
Fixes a published name error | U+FE18 → correct name |
control |
C0/C1 familiar abbreviation | U+0009 → TAB |
figment |
Erroneous name, retracted | U+E000 entry |
alternate |
Alternative widely-used name | U+FEFF → BYTE ORDER MARK |
abbreviation |
Short form of the name | U+FEFF → BOM |
Quick Facts
| Property | Value |
|---|---|
| Unicode property name | Name_Alias |
| Short alias | na1 (for first alias) |
Python unicodedata.lookup() |
Supports aliases since Python 3.x |
Python unicodedata.name() |
Returns formal name only |
| Immutability of formal Name | Names cannot change; aliases provide corrections |
| Spec reference | Unicode Standard Annex #44, NameAliases.txt |
相关术语
字符属性 中的更多内容
字符首次被分配时所在的Unicode版本,有助于判断各系统和软件版本的字符支持情况。
Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …
Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …
Unicode property listing all scripts that use a character, broader than the …
将每个码位归入30个类别(Lu、Ll、Nd、So等)之一的分类体系,分为7大类:字母、标记、数字、标点、符号、分隔符和其他。
具有相同抽象内容但外观可能不同的两个字符序列,比规范等价更宽泛,例如fi ≈ fi,² ≈ 2。
将字符映射为其组成部分的过程。规范分解保留语义(é → e + ◌́),兼容分解可能改变语义(fi → fi)。
命名的连续码位范围(如基本拉丁文 = U+0000–U+007F)。Unicode 16.0定义了336个区块,每个码位恰好属于一个区块。
决定字符在双向文本中(LTR、RTL、弱、中性)行为方式的属性,由Unicode双向算法用于确定显示顺序。
将字符在大写、小写和标题大小写之间转换的规则,可能因区域设置而异(土耳其语I问题),也存在一对多映射(ß → SS)。