默认可忽略字符
在不支持它们的处理过程中无可见效果、可被忽略的字符,包括变体选择符、零宽字符和语言标签。
What Are Default Ignorable Code Points?
A Default Ignorable Code Point is a character that should, by default, produce no visible glyph and no advance width when rendered. These characters exist to convey invisible semantic information—joining behavior, direction control, variation selection—without disturbing the visual flow of text when a renderer does not support them.
The rule is: if a process does not recognize or support a default ignorable character, it should silently discard it rather than display a replacement box (□) or a question mark. This allows documents using advanced Unicode features to degrade gracefully on older or simpler systems.
Important Default Ignorable Characters
| Code Point | Name | Use |
|---|---|---|
| U+00AD | SOFT HYPHEN (SHY) | Line-break hint; invisible unless break occurs |
| U+034F | COMBINING GRAPHEME JOINER | Prevents canonical reordering |
| U+200B | ZERO WIDTH SPACE | Line-break opportunity with no width |
| U+200C | ZERO WIDTH NON-JOINER (ZWNJ) | Prevents cursive joining in Arabic/Persian |
| U+200D | ZERO WIDTH JOINER (ZWJ) | Forces cursive joining; used in emoji sequences |
| U+2060 | WORD JOINER | Like NBSP but with no width |
| U+2061–U+2064 | Function Application, etc. | Mathematical invisible operators |
| U+FE00–U+FE0F | Variation Selectors 1–16 | Select text vs. emoji presentation |
| U+E0000–U+E01EF | Tags | Language tags (now largely deprecated) |
# ZWJ is used to combine emoji into sequences
family_emoji = "\U0001F468\u200D\U0001F469\u200D\U0001F467"
# MAN + ZWJ + WOMAN + ZWJ + GIRL = 👨👩👧
print(len(family_emoji)) # 5 code points (including 2 ZWJ)
print(family_emoji) # Renders as single family emoji on supported systems
# ZWNJ prevents Arabic ligature formation
# ك + ZWNJ + ا → kaf and alef do NOT join
# ك + ا → normal: join into ـكا
# Soft hyphen: invisible but marks a valid break point
word = "antidis\u00ADestablishment\u00ADarianism"
print(word) # Visible on most renderers without hyphens
print(len(word)) # 27 code points including 2 SHY
Testing for Default Ignorable
The Unicode property Default_Ignorable_Code_Point (DI) is a derived property. Characters with DI=Yes form a set that includes not just control and format characters but also many reserved code points in the Specials and Tag blocks.
# Using the 'regex' package for property-based matching
import regex
di_pattern = regex.compile(r'\p{Default_Ignorable_Code_Point}')
test = "Hello\u200BWorld" # contains ZWSP
matches = di_pattern.findall(test)
print(f"Found {len(matches)} default ignorable character(s)")
# Found 1 default ignorable character(s)
Quick Facts
| Property | Value |
|---|---|
| Unicode property name | Default_Ignorable_Code_Point |
| Short alias | DI |
| Type | Boolean |
| Expected renderer behavior | Produce no glyph, no width |
| Key characters | ZWJ (U+200D), ZWNJ (U+200C), VS1–VS16, SHY |
| Python built-in | No direct support; use regex package |
| Spec reference | Unicode Standard Section 5.21, DerivedCoreProperties.txt |
相关术语
字符属性 中的更多内容
字符首次被分配时所在的Unicode版本,有助于判断各系统和软件版本的字符支持情况。
Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …
Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. …
Unicode property listing all scripts that use a character, broader than the …
将每个码位归入30个类别(Lu、Ll、Nd、So等)之一的分类体系,分为7大类:字母、标记、数字、标点、符号、分隔符和其他。
具有相同抽象内容但外观可能不同的两个字符序列,比规范等价更宽泛,例如fi ≈ fi,² ≈ 2。
将字符映射为其组成部分的过程。规范分解保留语义(é → e + ◌́),兼容分解可能改变语义(fi → fi)。
命名的连续码位范围(如基本拉丁文 = U+0000–U+007F)。Unicode 16.0定义了336个区块,每个码位恰好属于一个区块。
决定字符在双向文本中(LTR、RTL、弱、中性)行为方式的属性,由Unicode双向算法用于确定显示顺序。
由于稳定性策略规定Unicode名称不可更改,因此提供字符的备用名称,用于更正、缩写和别名。