算法

String Comparison

Comparing Unicode strings requires normalization (NFC/NFD) and optionally collation (locale-aware sorting). Binary comparison of code points alone gives incorrect results for equivalent strings.

What is Unicode String Comparison?

Unicode string comparison is the process of determining whether two Unicode strings are equal, or which comes first in a sorted order. It sounds simple, but Unicode's encoding flexibility makes naive byte-by-byte comparison unreliable. The same text can be encoded in multiple valid ways, and language-specific sorting rules vary enormously across the world's scripts and locales.

Binary Comparison Pitfalls

The most obvious approach — comparing strings byte-by-byte or code-point-by-code-point — fails silently in many real-world situations. Consider the character é. It can be encoded as:

  • U+00E9 — a single precomposed code point (NFC form)
  • U+0065 U+0301 — the base letter e followed by a combining acute accent (NFD form)

These two sequences are canonically equivalent under the Unicode Standard but are binary-unequal. A naive comparison would declare them different strings, even though they represent identical text. This causes bugs in search, deduplication, password checks, and username lookups.

NFC Normalization Before Comparing

The standard defense is to normalize both strings to the same Unicode normalization form before comparing. NFC (Canonical Decomposition followed by Canonical Composition) is the recommended form for most applications because it produces compact, precomposed forms that work well with legacy systems.

import unicodedata

a = "e\u0301"          # e + combining acute (NFD-style)
b = "\u00e9"           # precomposed é (NFC-style)

a == b                                        # False — binary comparison
unicodedata.normalize("NFC", a) == unicodedata.normalize("NFC", b)  # True

Always normalize to NFC before storing usernames, email addresses, or any text that will be compared for equality.

Locale-Aware Collation with ICU

Sorting order is a separate problem from equality. In English, ä typically sorts near a, but in Swedish, ä sorts after z. In French, accents are compared from right-to-left as a tiebreaker. These rules are formalized in the Unicode Collation Algorithm (UCA) and implemented by the ICU library (International Components for Unicode).

In Python, the pyuca package or the locale module provide UCA-based sorting. In JavaScript, Intl.Collator wraps ICU directly.

// JavaScript locale-aware sort
const words = ["ä", "z", "a"];
words.sort(new Intl.Collator("sv").compare); // Swedish: ["a", "z", "ä"]
words.sort(new Intl.Collator("de").compare); // German: ["a", "ä", "z"]

Case-Insensitive Comparison via Case Folding

Converting both strings to lowercase before comparing does not work correctly across all scripts. Unicode defines case folding (documented in CaseFolding.txt) as a locale-independent way to erase case distinctions for comparison purposes. Case folding handles edge cases like the German ß, which folds to ss, and the Greek capital letter sigma Σ, which folds to σ.

# Python case folding
"Straße".casefold() == "STRASSE".casefold()  # True

Quick Facts

Property Value
Key pitfall Canonically equivalent strings are binary-unequal
Recommended normalization NFC for general text, NFD for internal processing
Collation standard Unicode Collation Algorithm (UCA), CLDR locale rules
ICU library International Components for Unicode
Case folding spec Unicode CaseFolding.txt (UCD)
Python module unicodedata (normalize, casefold)
JS API Intl.Collator, String.prototype.normalize()

相关术语

算法 中的更多内容

Case Folding

Mapping characters to a common case form for case-insensitive comparison. More comprehensive …

Grapheme Cluster Boundary

Rules (UAX#29) for determining where one user-perceived character ends and another begins. …

NFC (Canonical Composition)

规范化形式C:先分解再规范合成,生成最短形式,推荐用于数据存储和交换,是Web标准形式。

NFD (Canonical Decomposition)

规范化形式D:完全分解而不重新合成,macOS HFS+文件系统使用此形式。é(U+00E9)→ e + ◌́(U+0065 + U+0301)。

NFKC (Compatibility Composition)

规范化形式KC:兼容分解后再规范合成,合并视觉上相似的字符(fi→fi、²→2、Ⅳ→IV),用于标识符比较。

NFKD (Compatibility Decomposition)

规范化形式KD:兼容分解而不重新合成,是最激进的规范化方式,会丢失最多的格式信息。

Unicode 双向算法 (UBA)

利用字符双向类别和明确方向覆盖,确定混合方向文本(如英语+阿拉伯语)显示顺序的算法。

Unicode 换行算法

根据字符属性、CJK词边界和换行时机,确定文本可换至下一行位置的规则。

Unicode 排序算法 (UCA)

通过多级比较(基础字符→变音符号→大小写→决胜符)对Unicode字符串进行比较和排序的标准算法,支持区域设置自定义。

Unicode 文本分割

查找文本中各类边界的算法:字素簇、词和句子边界,对光标移动、文本选择和文本处理至关重要。