The Developer's Unicode Handbook · Capítulo 3
Comparison and Sorting
Sorting text correctly across languages requires understanding collation rules, locale sensitivity, and the Unicode Collation Algorithm. This chapter covers ICU collation, Python's locale, and JavaScript's Intl.Collator.
Two strings that look identical on screen can fail an equality check. Two strings that a user would intuitively sort the same way can end up in completely different positions. Unicode comparison and sorting is a minefield of normalization, locale rules, and edge cases that trip up even experienced developers. This chapter explains why these failures happen and how to implement comparison and sorting that actually works.
Why "café" Can Not Equal "café"
Unicode provides two ways to represent the character é: as a single precomposed character U+00E9 (LATIN SMALL LETTER E WITH ACUTE), or as the letter e followed by U+0301 (COMBINING ACUTE ACCENT). Both render identically. Both are valid Unicode. But they are different byte sequences, and naive string comparison treats them as different strings.
# Two visually identical strings
cafe_nfc = "caf\\u00E9" # NFC: e with acute as single codepoint
cafe_nfd = "cafe\\u0301" # NFD: e followed by combining acute
print(cafe_nfc == cafe_nfd) # False — raw comparison fails!
print(len(cafe_nfc)) # 4
print(len(cafe_nfd)) # 5
# The fix: normalize before comparing
import unicodedata
def normalize_equal(a: str, b: str) -> bool:
return unicodedata.normalize("NFC", a) == unicodedata.normalize("NFC", b)
print(normalize_equal(cafe_nfc, cafe_nfd)) # True
# In practice, normalize all text at ingestion time
def normalize_text(s: str) -> str:
return unicodedata.normalize("NFC", s)
The four normalization forms — NFC, NFD, NFKC, NFKD — differ in two dimensions:
- C (Composed) vs D (Decomposed): Whether to use precomposed characters or sequences of base character + combining marks.
- K (Compatibility): Whether to fold compatibility characters (like the ligature
fi→fi, or the circled number① → 1).
For most applications, NFC is the right choice: it's compact, common in web content, and what most APIs produce. Use NFKC when you need to normalize special symbols and compatibility characters (search engines often do this).
Case Folding vs Lowercasing: The Turkish İ Problem
The instinctive approach to case-insensitive comparison is a.lower() == b.lower(). This works for English but fails spectacularly for Turkish and Azerbaijani, which have a different case mapping for the dotted/dotless I:
# Turkish I problem
english = "i"
turkish_dotless = "\\u0131" # ı LATIN SMALL LETTER DOTLESS I
turkish_dotted_capital = "\\u0130" # İ LATIN CAPITAL LETTER I WITH DOT ABOVE
# In English: "I".lower() == "i"
# In Turkish: "I".lower() == "ı" (dotless i)
# "İ".lower() == "i" (dotted i)
# Python's str.lower() uses the Unicode default (not locale-aware)
print("I".lower()) # 'i' — English behavior
print("\\u0130".lower()) # 'i\\u0307' — İ → i + combining dot!
# For locale-aware case folding, use icu4c or the `icu` package
# For simple cross-lingual normalization, use casefold() instead of lower()
print("Straße".lower()) # 'straße' — German sharp-s preserved
print("Straße".casefold()) # 'strasse' — German sharp-s expanded
# casefold() is correct for case-insensitive comparison:
def case_insensitive_equal(a: str, b: str) -> bool:
return a.casefold() == b.casefold()
print(case_insensitive_equal("Straße", "STRASSE")) # True
print(case_insensitive_equal("café", "CAFÉ")) # True
For Turkish text specifically, you need locale-aware case conversion. Python's locale module has limited support; for production use, ICU (via pyicu) handles all edge cases correctly:
# Using PyICU for locale-aware case conversion
import icu # pip install pyicu
turkish_locale = icu.Locale("tr_TR")
text = "\\u0130stanbul" # İstanbul
lower_turkish = icu.UnicodeString(text).toLower(turkish_locale)
print(str(lower_turkish)) # 'istanbul' (correct Turkish lowercasing)
lower_english = icu.UnicodeString(text).toLower(icu.Locale("en_US"))
print(str(lower_english)) # 'istanbul' (happens to be same here)
The Unicode Collation Algorithm
Sorting strings correctly across languages requires the Unicode Collation Algorithm (UCA), defined in Unicode Technical Standard #10. Naive byte-order sorting produces absurd results for anything beyond basic ASCII:
# Naive sorting — wrong for most languages
words = ["résumé", "resume", "RESUME", "résume"]
print(sorted(words))
# ['RESUME', 'resume', 'résume', 'résumé']
# Uppercase before lowercase, accented after unaccented — wrong!
# Python's default sort uses codepoint order
print(sorted(["é", "e", "f", "ê"]))
# ['e', 'f', 'é', 'ê'] ← é and ê sorted AFTER f, wrong!
For locale-aware sorting in Python, use the locale module (limited) or PyICU (full UCA support):
import locale
locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
words = ["résumé", "resume", "éclair", "élan", "cafe", "café"]
print(sorted(words, key=locale.strxfrm))
# French-correct order: accent marks treated as secondary difference
# Better: PyICU Collator
import icu
collator = icu.Collator.createInstance(icu.Locale("fr_FR"))
words = ["résumé", "resume", "éclair", "élan", "cafe", "café"]
print(sorted(words, key=collator.getSortKey))
# Correctly sorted according to French rules
CLDR Tailoring and Locale-Specific Rules
The Unicode Common Locale Data Repository (CLDR) defines locale-specific collation rules that override the default UCA. These "tailorings" handle cases where a language has its own sorting conventions:
- Swedish:
å,ä,ösort afterz, not mixed withaando - Lithuanian:
ysorts betweeniandk - Spanish (traditional):
chandlltreated as single letters, sorting aftercandl - German (phone book):
äsorts asae,öasoe,üasue
import icu
# Swedish sorting
swedish_collator = icu.Collator.createInstance(icu.Locale("sv_SE"))
words = ["apa", "öga", "arm", "åker", "boll"]
print(sorted(words, key=swedish_collator.getSortKey))
# ['apa', 'arm', 'boll', 'åker', 'öga'] — å and ö at end, correct!
# German phone book vs dictionary sorting
phone_collator = icu.Collator.createInstance(icu.Locale("de_DE@collation=phonebook"))
dict_collator = icu.Collator.createInstance(icu.Locale("de_DE"))
words = ["Müller", "Mueller", "Moser"]
print(sorted(words, key=phone_collator.getSortKey)) # Moser, Mueller, Müller (ü=ue)
print(sorted(words, key=dict_collator.getSortKey)) # Moser, Müller, Mueller
Database ORDER BY with Collation
Database sorting respects collation settings. Default PostgreSQL uses C collation (byte order), which is fast but Unicode-wrong. For user-facing sorts, specify a collation:
-- PostgreSQL: locale-aware sorting
SELECT name FROM products
ORDER BY name COLLATE "fr-FR-x-icu"; -- French ICU collation
-- Or set column-level collation at creation:
CREATE TABLE products (
name TEXT COLLATE "fr-FR-x-icu"
);
-- MySQL: use a proper utf8mb4 unicode collation
SELECT name FROM products
ORDER BY name COLLATE utf8mb4_unicode_ci;
-- Check available collations:
-- PostgreSQL: SELECT * FROM pg_collation WHERE collname LIKE '%fr%';
-- MySQL: SHOW COLLATION WHERE Charset = 'utf8mb4';
Django ORM supports collation in ordering:
from django.db.models.functions import Collate
# Django 3.2+
products = Product.objects.order_by(Collate("name", "fr-FR-x-icu"))
Natural Sorting: file1, file2, file10
Natural sorting treats embedded numbers as numbers, not as character sequences. Without it, file10 sorts before file2 because "1" < "2".
import re
def natural_sort_key(s: str) -> list:
# Split string into text and numeric parts for natural sorting.
parts = re.split(r"(\\d+)", s)
return [int(p) if p.isdigit() else p.lower() for p in parts]
files = ["file10.txt", "file2.txt", "file1.txt", "file20.txt", "file3.txt"]
print(sorted(files)) # Lexicographic: file1, file10, file2...
print(sorted(files, key=natural_sort_key)) # Natural: file1, file2, file3, file10, file20
# For production use: natsort library handles Unicode and edge cases
# pip install natsort
from natsort import natsorted, ns
files = ["chapter10.md", "chapter2.md", "chapter1.md", "CHAPTER3.md"]
print(natsorted(files, alg=ns.IGNORECASE))
# ['chapter1.md', 'chapter2.md', 'CHAPTER3.md', 'chapter10.md']
The Complete Comparison Checklist
For robust Unicode comparison and sorting, follow these steps in order:
- Normalize at ingestion: Apply NFC (or NFKC for search/comparison) to all text when it enters your system.
- Case fold for case-insensitive operations: Use
.casefold()not.lower(). - Use locale-aware collation for user-visible sorts: ICU Collator with the user's locale, or database ICU collation.
- For exact identity comparison: Normalize first, then compare.
- For search: Normalize both query and corpus, then case-fold both.
- For natural sort: Use
natsortor equivalent.
The fundamental insight is that Unicode comparison is not about bytes — it's about meaning. Two representations of the same character should compare equal, and similar characters should sort together based on the conventions of the relevant language and culture.