算法

Case Folding

Mapping characters to a common case form for case-insensitive comparison. More comprehensive than lowercasing: German ß → ss, Turkish İ → i (with locale considerations).

What is Case Folding?

Case folding is a Unicode operation that converts text to a form suitable for case-insensitive comparison. It is defined in the Unicode Standard and supported by the CaseFolding.txt data file in the Unicode Character Database. Case folding is closely related to, but distinct from, simple lowercasing: while lowercasing converts a string to its lowercase representation for display, case folding converts a string to a canonical form specifically optimized for string comparison regardless of case.

The practical difference is that case folding handles edge cases that simple lowercasing misses — particularly in languages with complex case mapping behavior.

CaseFolding.txt: The Data File

The Unicode Consortium publishes CaseFolding.txt as part of the Unicode Character Database. It maps each character to its case-folded form using one of four status codes:

Status Meaning
C (Common) Safe for all contexts; included in both simple and full folding
F (Full) Full case folding only; maps one character to multiple characters
S (Simple) Simple case folding only; maps one character to one character
T (Turkic) Special folding for Turkic languages (replaces C/S mappings)

Simple vs. Full Case Folding

Simple case folding maps every character to at most one character — a one-to-one mapping. It is suitable for environments where string length must be preserved.

Full case folding allows one character to map to a sequence of multiple characters. The classic example is the German sharp S:

  • ß (U+00DF, Latin Small Letter Sharp S)
  • Simple fold: ß → ß (no change — no uppercase in simple mapping)
  • Full fold: ß → ss (two characters)

This means that a full case-fold comparison of "STRASSE" and "Straße" would correctly identify them as equal (both fold to "strasse"), while a simple lowercase comparison would not.

# Python uses full case folding via str.casefold()
"STRASSE".casefold() == "Straße".casefold()  # True
"STRASSE".lower() == "Straße".lower()         # False

# The key difference
"Straße".casefold()  # "strasse"
"Straße".lower()     # "straße"  ← ß preserved

Python's str.casefold() implements full Unicode case folding, while str.lower() implements Unicode simple lowercasing.

Locale-Sensitive Folding: The Turkish Problem

The most significant locale-specific case folding issue involves the Turkish and Azerbaijani I. In most languages:

  • Uppercase I → lowercase i
  • Uppercase İ does not exist (or is rare)

In Turkish and Azerbaijani: - Uppercase İ (U+0130, Latin Capital Letter I with Dot Above) → lowercase i (U+0069) - Uppercase I (U+0049, Latin Capital Letter I) → lowercase ı (U+0131, Latin Small Letter Dotless I)

The T (Turkic) status entries in CaseFolding.txt provide the Turkic-specific mappings. Standard Unicode case folding without the T entries is incorrect for Turkish text: it would map Ii rather than Iı, causing "KISA" and "kısa" (meaning "short") to compare as unequal while "KISA" and "kisa" would compare as equal — the wrong result.

# Correct Turkish case comparison requires locale awareness
import locale
# Python's str.casefold() uses C-locale folding (non-Turkic)
# For Turkish: use icu-python or a locale-aware library

How Case Folding Differs from Lowercasing

Operation Purpose Handles ß→ss Handles Turkish İ String length
str.lower() Display (lowercase) No (ß→ß) No (I→i) Preserved
str.casefold() Comparison Yes (ß→ss) No May increase
Turkic case fold Comparison in TR/AZ Yes Yes May increase

Quick Facts

Property Value
Data file CaseFolding.txt in Unicode Character Database
Status codes C (Common), F (Full), S (Simple), T (Turkic)
Key difference from lower() Full folding expands ß→ss
Turkish exception I→ı and İ→i (T status entries)
Python simple fold str.lower()
Python full fold str.casefold()
Use case Case-insensitive string comparison and search

相关术语

算法 中的更多内容

Grapheme Cluster Boundary

Rules (UAX#29) for determining where one user-perceived character ends and another begins. …

NFC (Canonical Composition)

规范化形式C:先分解再规范合成,生成最短形式,推荐用于数据存储和交换,是Web标准形式。

NFD (Canonical Decomposition)

规范化形式D:完全分解而不重新合成,macOS HFS+文件系统使用此形式。é(U+00E9)→ e + ◌́(U+0065 + U+0301)。

NFKC (Compatibility Composition)

规范化形式KC:兼容分解后再规范合成,合并视觉上相似的字符(fi→fi、²→2、Ⅳ→IV),用于标识符比较。

NFKD (Compatibility Decomposition)

规范化形式KD:兼容分解而不重新合成,是最激进的规范化方式,会丢失最多的格式信息。

String Comparison

Comparing Unicode strings requires normalization (NFC/NFD) and optionally collation (locale-aware sorting). Binary …

Unicode 双向算法 (UBA)

利用字符双向类别和明确方向覆盖,确定混合方向文本(如英语+阿拉伯语)显示顺序的算法。

Unicode 换行算法

根据字符属性、CJK词边界和换行时机,确定文本可换至下一行位置的规则。

Unicode 排序算法 (UCA)

通过多级比较(基础字符→变音符号→大小写→决胜符)对Unicode字符串进行比较和排序的标准算法,支持区域设置自定义。

Unicode 文本分割

查找文本中各类边界的算法:字素簇、词和句子边界,对光标移动、文本选择和文本处理至关重要。