The Unicode Odyssey · 第 4 章

Characters Are Not What You Think

What you see as a single character on screen might be multiple code points combined. This chapter explores combining marks, grapheme clusters, and emoji sequences — the gap between code points and visual characters.

~4000 字 · ~16 分钟阅读 · · Updated

Ask a non-technical person how many characters are in the word "café" and they'll say four without hesitation. Ask a Python programmer who hasn't thought carefully about Unicode and they might say the same — four. Ask the Python interpreter and you might get four, or you might get five, depending on how the string was constructed. Ask a JavaScript engine and .length might return five or six. Welcome to one of the most conceptually rich — and practically treacherous — areas of Unicode: the distinction between codepoints, grapheme clusters, and bytes.

Three Levels of "Length"

For any string of text, there are at least three different things you might mean by "length":

  1. Byte length: How many bytes does the encoded representation occupy?
  2. Codepoint count: How many Unicode codepoints does it contain?
  3. Grapheme cluster count: How many user-perceived characters does it contain?

These three numbers are frequently different, and treating them as equivalent is the source of a vast family of Unicode bugs.

Consider the word "café" written with a combining character:

c  a  f  e  ́
U+0063 U+0061 U+0066 U+0065 U+0301

Here "é" is represented as two codepoints: U+0065 (LATIN SMALL LETTER E) followed by U+0301 (COMBINING ACUTE ACCENT). The visual result is identical to using U+00E9 (LATIN SMALL LETTER E WITH ACUTE) — but the codepoint count differs:

  • Using precomposed é (U+00E9): 4 codepoints, 5 UTF-8 bytes, 4 grapheme clusters
  • Using decomposed e + ́ (U+0065 U+0301): 5 codepoints, 6 UTF-8 bytes, 4 grapheme clusters

The user sees four characters. Unicode has two ways to represent one of them. This is why normalization (covered in a later chapter) exists.

Combining Characters: Invisible Modifiers

Combining characters (Unicode category class Mn, Mc, Me) are codepoints that attach to the preceding character rather than standing independently. They have no visual representation on their own — they only make sense when applied to a base character.

The combining acute accent (U+0301) is just one example. Unicode contains hundreds of combining characters:

Codepoint Name Effect
U+0301 COMBINING ACUTE ACCENT base character + ́
U+0300 COMBINING GRAVE ACCENT base character + ̀
U+0302 COMBINING CIRCUMFLEX ACCENT base character + ̂
U+0327 COMBINING CEDILLA base character + ̧
U+0308 COMBINING DIAERESIS base character + ̈
U+20D7 COMBINING RIGHT ARROW ABOVE base character + arrow overlay
U+0336 COMBINING LONG STROKE OVERLAY base character + strikethrough

Multiple combining characters can stack on a single base. The character Ạ̵ could theoretically be represented as A + combining dot below + combining short stroke overlay — three codepoints rendering as one visual unit.

The combining character mechanism is how Unicode handles the enormous variety of diacritics and modifiers found in world writing systems without needing to assign a separate codepoint for every possible base+modifier combination.

Grapheme Clusters: What Humans See

The Unicode Standard defines grapheme clusters in UAX #29 (Unicode Text Segmentation) as sequences of codepoints that form a single user-perceived character. The grapheme cluster algorithm defines the boundaries that users would intuitively call "character positions" — where cursor movement, backspace deletion, and character counting should operate.

A grapheme cluster typically consists of: - A base character (any character that can stand alone) - Followed by zero or more combining marks - Possibly followed by additional codepoints per specific rules (like ZWJ sequences, described below)

Correct text editing requires grapheme cluster awareness. If you press backspace at the end of "café" (with the combining accent), you should delete the entire "é" (base + combining accent) in one keystroke, not leave a dangling accent on the screen. Applications that don't implement grapheme cluster-aware deletion produce visually bizarre results.

ZWJ Sequences: Building Complex Emoji

The Zero Width Joiner (U+200D, ZWJ) is a formatting character that instructs rendering systems to combine adjacent characters into a single ligature where supported. In emoji, ZWJ sequences create complex, multi-person, multi-component emoji from simpler base emoji.

The classic example is the family emoji: 👨‍👩‍👧‍👦

This single rendered unit is actually a sequence of 7 codepoints:

U+1F468  MAN
U+200D   ZERO WIDTH JOINER
U+1F469  WOMAN
U+200D   ZERO WIDTH JOINER
U+1F467  GIRL
U+200D   ZERO WIDTH JOINER
U+1F466  BOY

In JavaScript:

const family = "\\u{1F468}\\u200D\\u{1F469}\\u200D\\u{1F467}\\u200D\\u{1F466}";
console.log(family.length);          // 11 (UTF-16 code units: each emoji = 2 surrogates)
console.log([...family].length);     // 7 (Unicode codepoints via spread)
// But visually: 1 character

The spread operator [...str] iterates over Unicode codepoints (correctly handling surrogate pairs), giving 7. But the visual result is a single family emoji. Getting the grapheme cluster count requires dedicated library functions:

// Using Intl.Segmenter (ES2022+)
const segmenter = new Intl.Segmenter();
const graphemes = [...segmenter.segment(family)];
console.log(graphemes.length);  // 1

Other ZWJ Sequence Examples

Sequence Codepoints Result
👩‍💻 U+1F469 ZWJ U+1F4BB Woman Technologist
🏳️‍🌈 U+1F3F3 U+FE0F ZWJ U+1F308 Rainbow Flag
👨‍🦳 U+1F468 ZWJ U+1F9B3 Man with White Hair

The variation selector U+FE0F (VS16) also appears frequently in emoji sequences, requesting emoji-style presentation rather than text-style rendering of characters that have both modes (like ❤ which can render as text or emoji).

Skin Tone Modifiers

Emoji skin tones add another layer. The five skin tone modifier codepoints (U+1F3FB through U+1F3FF, based on the Fitzpatrick scale) combine with human emoji to modify their appearance:

👋  (U+1F44B, WAVING HAND SIGN)
👋🏻 (U+1F44B U+1F3FB, light skin)
👋🏿 (U+1F44B U+1F3FF, dark skin)

Each is two codepoints but one grapheme cluster (one user-perceived character).

The Flag Problem

Country flag emoji are represented using Regional Indicator Symbol Letters (U+1F1E6–U+1F1FF). Each flag is two letters, each encoded as a Regional Indicator:

🇺🇸 = U+1F1FA (Regional Indicator U) + U+1F1F8 (Regional Indicator S)
🇬🇧 = U+1F1EC (Regional Indicator G) + U+1F1E7 (Regional Indicator B)

Two codepoints, one grapheme cluster, one flag (on supporting systems). On systems that don't support flag rendering, they appear as the two letters "US" or "GB" — a graceful degradation that preserves meaning.

Practical Consequences for Developers

String Length and Indexing

Never use byte length or codepoint count as a proxy for "character count" visible to users. Always use grapheme cluster count for user-facing length validation (like "maximum 280 characters" for a tweet).

# Python 3
s = "caf\\u00e9"       # café with precomposed é
len(s)                 # 4 codepoints (correct for this representation)

s2 = "cafe\\u0301"     # café with combining accent
len(s2)                # 5 codepoints (wrong for user count)

# Correct approach (Python): use grapheme library
# pip install grapheme
import grapheme
grapheme.length("cafe\\u0301")  # 4

Reversing Strings

Naive string reversal breaks combining sequences:

s = "caf\\u0065\\u0301"  # café with combining accent
wrong = s[::-1]           # acute accent detaches, attaches to 'f'

A correct Unicode string reversal must reverse grapheme clusters, not codepoints.

Regular Expressions

The dot . in most regex engines matches one codepoint, not one grapheme cluster. Matching emoji sequences requires extended patterns:

// Doesn't match emoji with skin tone (2 codepoints)
/^.$/u.test("👋🏻")  // false

// Use \p{Emoji_Presentation} or grapheme-aware segmentation

Database Storage

Database VARCHAR(100) means different things in different databases. In PostgreSQL with UTF-8 encoding, varchar(100) counts characters (codepoints), not bytes. In MySQL, the count depends on the column charset and row format. Always verify what your database counts.

Text Segmentation Algorithms (UAX #29)

The Unicode Standard provides precise, formal algorithms for:

  • Grapheme cluster segmentation: Where are the user-perceived character boundaries?
  • Word segmentation: Where are word boundaries? (This is language-dependent — Thai and Chinese have no word-separating spaces)
  • Sentence segmentation: Where do sentences end?
  • Line break opportunities: Where is it acceptable to wrap text to a new line?

These algorithms reference character properties (Grapheme_Cluster_Break, Word_Break, Line_Break) to make their decisions. Implementing them correctly is non-trivial, which is why mature Unicode libraries (ICU, libgrapheme) are preferred over hand-rolled implementations.

The apparent simplicity of "how many characters does this string have?" conceals a rich system of rules about human writing. In the Unicode world, a character is not a byte, not necessarily a codepoint, and not always what appears on screen as a single glyph. It's a semantic unit defined by the intersection of encoding, rendering, and linguistic convention — and getting it right matters for every application that handles text from the full breadth of human language.