组合字符
附着在前一个基字符上并修改它的字符,一般类别为Mn(非空白)、Mc(空白组合)、Me(包围),例如◌́(U+0301组合锐音符)。
What is a Combining Character?
A combining character is a Unicode character that has no independent visual form of its own — instead, it attaches to and modifies the preceding character (called the base character). Combining characters implement diacritical marks, tone marks, vowel signs, and other modifier symbols that, in Unicode's model, are logically separate from the letter they modify.
The key insight is that Unicode separates the identity of a character from its rendering. A letter with a diacritic can be represented either as a single precomposed code point (é = U+00E9) or as a base letter followed by a combining mark (e + ◌́ = U+0065 U+0301). Both sequences represent the same abstract character and produce the same rendered output.
How Combining Characters Work
Combining characters have a General Category of Mn (Non-spacing Mark), Mc (Spacing Mark), or Me (Enclosing Mark). Non-spacing marks are the most common — they occupy zero advance width and position themselves relative to the base character's glyph using the font's anchor points.
When a text shaping engine encounters a base character followed by one or more combining marks, it: 1. Retrieves the base glyph from the font 2. Positions each combining glyph at the appropriate anchor (top, bottom, left, right) 3. Renders them as a single grapheme cluster
Multiple combining characters can stack on a single base:
a + ◌̂ + ◌̄ = â̄ (a with circumflex and macron)
Unicode defines a canonical ordering for combining marks based on their Combining Class value (0–255). Marks with lower combining class values (e.g., below-base nuktas) appear before marks with higher values (e.g., above-base accents) in normalized text.
Important Combining Character Ranges
| Range | Name | Contents |
|---|---|---|
| U+0300–U+036F | Combining Diacritical Marks | Accents, umlauts, tildes |
| U+0591–U+05C7 | Hebrew Cantillation/Vowels | Nikud, cantillation marks |
| U+064B–U+065F | Arabic Diacritics | Harakat (vowel marks) |
| U+1AB0–U+1AFF | Combining Diacritical Marks Extended | Extended phonetic use |
| U+1DC0–U+1DFF | Combining Diacritical Marks Supplement | Additional marks |
| U+20D0–U+20FF | Combining Diacritical Marks for Symbols | Used with math symbols |
Grapheme Clusters
A base character plus all its combining marks form a grapheme cluster — the unit that users perceive as a single character. Programming languages must account for this:
import unicodedata
# "e" + combining acute = é
s = "e\u0301"
print(len(s)) # 2 (two code points)
print(s) # é (looks like 1 character)
# Precomposed é
s2 = "\u00e9"
print(len(s2)) # 1 (one code point)
print(s == s2) # False! Different code points
# Normalize to compare
import unicodedata
print(unicodedata.normalize("NFC", s) == s2) # True
JavaScript's Intl.Segmenter and Swift's String.count handle grapheme clusters correctly; many other APIs count code points instead.
Quick Facts
| Property | Value |
|---|---|
| Unicode category | Mn (Non-spacing Mark), Mc (Spacing Mark), Me (Enclosing Mark) |
| Main combining block | U+0300–U+036F (112 characters) |
| Combining class range | 0 (base) to 255 (various positions) |
| Max stacking | No hard limit; practical fonts support 2–4 layers |
| Normalization | NFC = precomposed preferred; NFD = fully decomposed |
| Grapheme cluster API | Python: regex module; JS: Intl.Segmenter; Swift: String |
| Visual indicator in charts | Often shown as ◌ (dotted circle) placeholder |
相关术语
排版印刷 中的更多内容
CSS @font-face descriptor specifying which Unicode code points a font should cover. …
The mechanism by which a rendering engine substitutes glyphs from a secondary …
Modern font format developed by Microsoft and Adobe supporting up to 65,535 …
字符从右向左流动的文本方向,用于阿拉伯语、希伯来语、塔阿纳等文字,正确显示需要双向算法。
Fonts downloaded by the browser to render text, declared via CSS @font-face. …
U+00A0,防止在该位置换行的空格。HTML中为 ,用于数字与单位之间(100 km)、专有名词(Mr. Smith)和缩写之后。
全角(Em):等于字号的宽度;半角(En):全角的一半,用于定义全角破折号宽度、全角空格、半角空格和CSS单位(1em、0.5em)。
附加在字母上以改变发音或意义的符号,可以是预组合形式(é U+00E9)或组合形式(e + ◌́ U+0065+U+0301),包括重音、变音符、软音符和波浪号等。
特定大小、字重和样式的字型实现,在数字排版中指包含字形定义和度量的字体文件(TTF、OTF、WOFF2)。
字体渲染的字符视觉表现形式。一个字符可有多个字形(连字、上下文形式),一个字形也可表示多个字符。