Unicode Collation: Sorting Text Correctly

Sorting text sounds simple until you try to do it correctly across languages. Alphabetical order is not universal — Swedish treats ä as a letter that comes after z, while German traditionally treats ä as equivalent to ae, and both of those behaviors differ entirely from a naive sort that compares raw Unicode codepoints. The Unicode Collation Algorithm (UCA), defined in Unicode Technical Standard #10 (UTS #10), provides a systematic, extensible framework for language-aware text sorting.

Why Naive Codepoint Sorting Fails

The simplest approach to sorting Unicode strings is to compare their codepoints numerically. This works adequately for pure ASCII text but breaks immediately when you step outside it.

Consider these problems with codepoint-order sorting:

Case sensitivity surprises: Uppercase letters (A–Z, U+0041–U+005A) have lower codepoints than lowercase (a–z, U+0061–U+007A), so "Zebra" sorts before "apple" — the opposite of what most users expect.
Accented characters land in wrong places: "é" (U+00E9) sits far from "e" (U+0065) in codepoint space, so words with accents get pushed to the end of lists rather than interleaved with their base-letter neighbors.
Decomposed vs composed forms: "é" can be represented as a single precomposed codepoint (U+00E9) or as "e" followed by a combining acute accent (U+0065 U+0301). These two representations are canonically equivalent but have completely different sort orders under codepoint comparison.
Script mixing: CJK characters, Arabic letters, and Latin letters are scattered across very different codepoint ranges, making cross-script sorted order arbitrary and inconsistent.

The Multi-Level Comparison Model

The UCA solves these problems by decomposing every comparison into multiple independent levels, called collation weights. Each character (or sequence of characters) maps to a collation element containing weights at four levels:

Level	Called	What it compares
L1	Primary	Base letter identity (a vs b vs c)
L2	Secondary	Accent / diacritic differences (a vs á vs à)
L3	Tertiary	Case and variant differences (a vs A)
L4	Quaternary	Punctuation and special-character distinctions

The algorithm first sorts entirely by L1 weights, ignoring accents and case. Only when two strings are L1-equal does it compare L2 weights to distinguish accented variants. Only when L1 and L2 are both equal does it fall back to L3 for case, and so on. This produces the intuitive result that "cafe" and "café" sort very close together, with the accented form immediately following the plain form, rather than being separated by an alphabetical gulf.

Collation Element Tables

The UCA defines the Default Unicode Collation Element Table (DUCET), a large lookup table mapping Unicode characters and character sequences to their collation elements. DUCET is the baseline — a language-neutral starting point. A single character might have a simple one-element mapping, or a sequence of characters (a "contraction") might map to a single collation element that sorts differently than the individual characters would.

For example, the digraph "ch" in traditional Spanish sorting was historically treated as a single collation unit that sorted between "c" and "d". This requires a contraction entry mapping the two-character sequence "ch" to a collation element positioned between the elements for "c" and "d".

CLDR Locale-Specific Tailoring

DUCET alone is not enough. Different languages have different sorting conventions, and the Unicode Common Locale Data Repository (CLDR) provides locale tailorings — modifications layered on top of DUCET that implement language-specific rules.

Some illustrative examples:

Swedish: The letters ä, å, and ö are treated as distinct letters that sort at the very end of the alphabet, after z. Under DUCET, these would sort among the a-variants and o-variants respectively. Swedish tailoring reassigns their primary weights to positions after z.

German (DIN 5007-2, phonebook order): Ä, Ö, Ü are treated as expansions of AE, OE, UE respectively for sorting purposes. So "Müller" sorts as if it were "Mueller". This is specifically the phonebook collation; German dictionary collation treats umlauts differently.

Traditional Spanish: The historic ch/ll digraph contractions place these two-character sequences as atomic units in collation, between c/d and l/m respectively.

Japanese: Japanese sorting involves multiple scripts (Hiragana, Katakana, Kanji) with Hiragana and Katakana treated as equivalent at the primary level (both represent the same sounds), with script distinction pushed to a lower level.

Tailorings are expressed using the CLDR rule syntax, which specifies relationships like "ä sorts after z" using a compact notation.

ICU Collation API Examples

The most widely used implementation of the UCA is the ICU library. Here is how you use locale-aware collation in several languages:

Python (via PyICU):

import icu

# Swedish collation — ä sorts after z
collator = icu.Collator.createInstance(icu.Locale("sv_SE"))
words = ["zebra", "apple", "ärlig", "banana"]
sorted_words = sorted(words, key=collator.getSortKey)
# Result: ['apple', 'banana', 'zebra', 'ärlig']

# German phonebook collation
de_collator = icu.Collator.createInstance(icu.Locale("de@collation=phonebook"))
names = ["Müller", "Mueller", "Maier"]
sorted_names = sorted(names, key=de_collator.getSortKey)
# Müller and Mueller sort as neighbors because Ü → UE expansion

Java (built-in java.text.Collator):

import java.text.Collator;
import java.util.Locale;
import java.util.Arrays;

Collator svCollator = Collator.getInstance(new Locale("sv", "SE"));
String[] words = {"zebra", "apple", "ärlig", "banana"};
Arrays.sort(words, svCollator);
// ärlig appears after zebra in Swedish order

JavaScript (Intl.Collator):

const svCollator = new Intl.Collator('sv');
const words = ['zebra', 'apple', 'ärlig', 'banana'];
words.sort((a, b) => svCollator.compare(a, b));
// ['apple', 'banana', 'zebra', 'ärlig']

Sort Keys and Performance

The UCA generates a sort key — a sequence of bytes that encodes all the collation weight levels in a single string. Once sort keys are computed, comparing two strings reduces to a simple byte-by-byte comparison, which is fast. Sort keys can be stored in a database index column alongside the original text, enabling efficient ORDER BY collation without recomputing weights per comparison.

The tradeoff is that sort key generation is expensive upfront, and sort keys are locale-specific: a sort key computed for Swedish sorting cannot be directly compared against one computed for German phonebook order.

Practical Guidance

When building applications that display sorted lists of names, words, or other text to users:

Always use a locale-aware collator, never codepoint comparison or .sort() on raw strings.
Choose the correct CLDR locale for your audience. When multiple locales are plausible, consider exposing a language preference setting.
For database-level sorting, configure the correct collation on the column at table creation time. Changing collation later requires rebuilding indexes.
If you need sort keys in application code, cache them — generating a sort key per string per comparison call is wasteful.

The Unicode Collation Algorithm transforms text sorting from a seemingly trivial operation into a well-specified, internationally correct one. Using it correctly is one of the most impactful steps you can take toward genuinely multilingual software.