The Developer's Unicode Handbook · Bab 3

Comparison and Sorting

Sorting text correctly across languages requires understanding collation rules, locale sensitivity, and the Unicode Collation Algorithm. This chapter covers ICU collation, Python's locale, and JavaScript's Intl.Collator.

~4.000 kata · ~16 menit baca · · Updated

Two strings that look identical on screen can fail an equality check. Two strings that a user would intuitively sort the same way can end up in completely different positions. Unicode comparison and sorting is a minefield of normalization, locale rules, and edge cases that trip up even experienced developers. This chapter explains why these failures happen and how to implement comparison and sorting that actually works.

Why "café" Can Not Equal "café"

Unicode provides two ways to represent the character é: as a single precomposed character U+00E9 (LATIN SMALL LETTER E WITH ACUTE), or as the letter e followed by U+0301 (COMBINING ACUTE ACCENT). Both render identically. Both are valid Unicode. But they are different byte sequences, and naive string comparison treats them as different strings.

# Two visually identical strings
cafe_nfc = "caf\\u00E9"       # NFC: e with acute as single codepoint
cafe_nfd = "cafe\\u0301"      # NFD: e followed by combining acute

print(cafe_nfc == cafe_nfd)        # False — raw comparison fails!
print(len(cafe_nfc))               # 4
print(len(cafe_nfd))               # 5

# The fix: normalize before comparing
import unicodedata

def normalize_equal(a: str, b: str) -> bool:
    return unicodedata.normalize("NFC", a) == unicodedata.normalize("NFC", b)

print(normalize_equal(cafe_nfc, cafe_nfd))  # True

# In practice, normalize all text at ingestion time
def normalize_text(s: str) -> str:
    return unicodedata.normalize("NFC", s)

The four normalization forms — NFC, NFD, NFKC, NFKD — differ in two dimensions:

  • C (Composed) vs D (Decomposed): Whether to use precomposed characters or sequences of base character + combining marks.
  • K (Compatibility): Whether to fold compatibility characters (like the ligature fi, or the circled number ① → 1).

For most applications, NFC is the right choice: it's compact, common in web content, and what most APIs produce. Use NFKC when you need to normalize special symbols and compatibility characters (search engines often do this).

Case Folding vs Lowercasing: The Turkish İ Problem

The instinctive approach to case-insensitive comparison is a.lower() == b.lower(). This works for English but fails spectacularly for Turkish and Azerbaijani, which have a different case mapping for the dotted/dotless I:

# Turkish I problem
english = "i"
turkish_dotless = "\\u0131"  # ı LATIN SMALL LETTER DOTLESS I
turkish_dotted_capital = "\\u0130"  # İ LATIN CAPITAL LETTER I WITH DOT ABOVE

# In English: "I".lower() == "i"
# In Turkish: "I".lower() == "ı" (dotless i)
#             "İ".lower() == "i" (dotted i)

# Python's str.lower() uses the Unicode default (not locale-aware)
print("I".lower())           # 'i'   — English behavior
print("\\u0130".lower())     # 'i\\u0307'  — İ → i + combining dot!

# For locale-aware case folding, use icu4c or the `icu` package
# For simple cross-lingual normalization, use casefold() instead of lower()
print("Straße".lower())     # 'straße'  — German sharp-s preserved
print("Straße".casefold())  # 'strasse' — German sharp-s expanded

# casefold() is correct for case-insensitive comparison:
def case_insensitive_equal(a: str, b: str) -> bool:
    return a.casefold() == b.casefold()

print(case_insensitive_equal("Straße", "STRASSE"))  # True
print(case_insensitive_equal("café", "CAFÉ"))        # True

For Turkish text specifically, you need locale-aware case conversion. Python's locale module has limited support; for production use, ICU (via pyicu) handles all edge cases correctly:

# Using PyICU for locale-aware case conversion
import icu  # pip install pyicu

turkish_locale = icu.Locale("tr_TR")
text = "\\u0130stanbul"  # İstanbul

lower_turkish = icu.UnicodeString(text).toLower(turkish_locale)
print(str(lower_turkish))  # 'istanbul' (correct Turkish lowercasing)

lower_english = icu.UnicodeString(text).toLower(icu.Locale("en_US"))
print(str(lower_english))  # 'istanbul' (happens to be same here)

The Unicode Collation Algorithm

Sorting strings correctly across languages requires the Unicode Collation Algorithm (UCA), defined in Unicode Technical Standard #10. Naive byte-order sorting produces absurd results for anything beyond basic ASCII:

# Naive sorting — wrong for most languages
words = ["résumé", "resume", "RESUME", "résume"]
print(sorted(words))
# ['RESUME', 'resume', 'résume', 'résumé']
# Uppercase before lowercase, accented after unaccented — wrong!

# Python's default sort uses codepoint order
print(sorted(["é", "e", "f", "ê"]))
# ['e', 'f', 'é', 'ê']  ← é and ê sorted AFTER f, wrong!

For locale-aware sorting in Python, use the locale module (limited) or PyICU (full UCA support):

import locale
locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")

words = ["résumé", "resume", "éclair", "élan", "cafe", "café"]
print(sorted(words, key=locale.strxfrm))
# French-correct order: accent marks treated as secondary difference

# Better: PyICU Collator
import icu

collator = icu.Collator.createInstance(icu.Locale("fr_FR"))
words = ["résumé", "resume", "éclair", "élan", "cafe", "café"]
print(sorted(words, key=collator.getSortKey))
# Correctly sorted according to French rules

CLDR Tailoring and Locale-Specific Rules

The Unicode Common Locale Data Repository (CLDR) defines locale-specific collation rules that override the default UCA. These "tailorings" handle cases where a language has its own sorting conventions:

  • Swedish: å, ä, ö sort after z, not mixed with a and o
  • Lithuanian: y sorts between i and k
  • Spanish (traditional): ch and ll treated as single letters, sorting after c and l
  • German (phone book): ä sorts as ae, ö as oe, ü as ue
import icu

# Swedish sorting
swedish_collator = icu.Collator.createInstance(icu.Locale("sv_SE"))
words = ["apa", "öga", "arm", "åker", "boll"]
print(sorted(words, key=swedish_collator.getSortKey))
# ['apa', 'arm', 'boll', 'åker', 'öga']  — å and ö at end, correct!

# German phone book vs dictionary sorting
phone_collator = icu.Collator.createInstance(icu.Locale("de_DE@collation=phonebook"))
dict_collator = icu.Collator.createInstance(icu.Locale("de_DE"))

words = ["Müller", "Mueller", "Moser"]
print(sorted(words, key=phone_collator.getSortKey))  # Moser, Mueller, Müller (ü=ue)
print(sorted(words, key=dict_collator.getSortKey))   # Moser, Müller, Mueller

Database ORDER BY with Collation

Database sorting respects collation settings. Default PostgreSQL uses C collation (byte order), which is fast but Unicode-wrong. For user-facing sorts, specify a collation:

-- PostgreSQL: locale-aware sorting
SELECT name FROM products
ORDER BY name COLLATE "fr-FR-x-icu";  -- French ICU collation

-- Or set column-level collation at creation:
CREATE TABLE products (
    name TEXT COLLATE "fr-FR-x-icu"
);

-- MySQL: use a proper utf8mb4 unicode collation
SELECT name FROM products
ORDER BY name COLLATE utf8mb4_unicode_ci;

-- Check available collations:
-- PostgreSQL: SELECT * FROM pg_collation WHERE collname LIKE '%fr%';
-- MySQL: SHOW COLLATION WHERE Charset = 'utf8mb4';

Django ORM supports collation in ordering:

from django.db.models.functions import Collate

# Django 3.2+
products = Product.objects.order_by(Collate("name", "fr-FR-x-icu"))

Natural Sorting: file1, file2, file10

Natural sorting treats embedded numbers as numbers, not as character sequences. Without it, file10 sorts before file2 because "1" < "2".

import re

def natural_sort_key(s: str) -> list:
    # Split string into text and numeric parts for natural sorting.
    parts = re.split(r"(\\d+)", s)
    return [int(p) if p.isdigit() else p.lower() for p in parts]

files = ["file10.txt", "file2.txt", "file1.txt", "file20.txt", "file3.txt"]
print(sorted(files))                              # Lexicographic: file1, file10, file2...
print(sorted(files, key=natural_sort_key))        # Natural: file1, file2, file3, file10, file20

# For production use: natsort library handles Unicode and edge cases
# pip install natsort
from natsort import natsorted, ns

files = ["chapter10.md", "chapter2.md", "chapter1.md", "CHAPTER3.md"]
print(natsorted(files, alg=ns.IGNORECASE))
# ['chapter1.md', 'chapter2.md', 'CHAPTER3.md', 'chapter10.md']

The Complete Comparison Checklist

For robust Unicode comparison and sorting, follow these steps in order:

  1. Normalize at ingestion: Apply NFC (or NFKC for search/comparison) to all text when it enters your system.
  2. Case fold for case-insensitive operations: Use .casefold() not .lower().
  3. Use locale-aware collation for user-visible sorts: ICU Collator with the user's locale, or database ICU collation.
  4. For exact identity comparison: Normalize first, then compare.
  5. For search: Normalize both query and corpus, then case-fold both.
  6. For natural sort: Use natsort or equivalent.

The fundamental insight is that Unicode comparison is not about bytes — it's about meaning. Two representations of the same character should compare equal, and similar characters should sort together based on the conventions of the relevant language and culture.