アルゴリズム

Unicode 照合アルゴリズム (UCA)

基本文字 → アクセント → 大小文字 → タイブレーカーの多段階比較でUnicode文字列を比較・ソートする標準アルゴリズム。ロケールのカスタマイズが可能です。

· Updated

Sorting Strings Across Languages

ASCII sorting is simple: compare byte values. But for multilingual text, byte-order sorting produces absurd results: "ä" sorts after "z" in ASCII, "纳" appears nowhere near "那" despite being phonetically similar, and "naïve" sorts far from "naive". The Unicode Collation Algorithm (UCA), specified in Unicode Technical Standard #10, provides a framework for language-aware sorting.

Multi-Level Comparison Keys

The UCA uses up to four comparison levels, applied in order. If two strings are equal at level 1, level 2 is consulted, and so on:

Level Name Distinguishes
L1 Primary Base characters (a ≠ b, a = á at this level)
L2 Secondary Accents / diacritics (a ≠ á)
L3 Tertiary Case / variants (a ≠ A, AK ≠ AK)
L4 Quaternary Punctuation, special marks

This means a case-insensitive, accent-insensitive search operates at level 1: "naïve" equals "naive" equals "NAIVE". A case-insensitive but accent-sensitive sort uses levels 1–2: "naive" < "naïve" because they differ at level 2.

import locale

# On Linux/macOS with proper locale support
locale.setlocale(locale.LC_ALL, "en_US.UTF-8")
words = ["naive", "naïve", "NAIVE", "résumé", "resume"]
words.sort(key=locale.strxfrm)
print(words)  # locale-aware sort

# For robust multilingual sorting, use the PyICU library
# pip install pyicu
from icu import Collator, Locale
collator = Collator.createInstance(Locale("en_US"))
words.sort(key=collator.getSortKey)

CLDR Tailorings

The UCA defines a default collation order (DUCET — Default Unicode Collation Element Table), but different locales require different sort orders. The Unicode Common Locale Data Repository (CLDR) provides tailorings that modify the UCA for specific languages:

  • In Swedish, v and w are treated as equivalent at the primary level (both sort before x)
  • In Spanish (traditional), ch sorts as a unit after all c words
  • In German (phone book order), ä = ae, so "Ärger" sorts with "Ärger" near "Aero"
  • In Japanese, hiragana and katakana may be treated as equivalent at level 1

Without CLDR tailorings, sorting Swedish or German text with the default UCA produces results that feel wrong to native speakers.

Practical Python Sorting

# Simple locale-aware sort (depends on OS locale support)
import locale
locale.setlocale(locale.LC_COLLATE, "de_DE.UTF-8")
german_words = ["Äpfel", "Apfel", "Zorn", "außen"]
german_words.sort(key=locale.strxfrm)

# Check if a string needs normalization before collation
import unicodedata
def collation_key(s: str) -> str:
    return locale.strxfrm(unicodedata.normalize("NFC", s))

Quick Facts

Property Value
Specification Unicode Technical Standard #10 (UTS #10)
Full name Unicode Collation Algorithm (UCA)
Default table DUCET (Default Unicode Collation Element Table)
Locale data CLDR (Common Locale Data Repository)
Comparison levels 4 levels (primary, secondary, tertiary, quaternary)
Python (basic) locale.strxfrm()
Python (full ICU) PyICU library — Collator.createInstance()
Java / ICU java.text.Collator, com.ibm.icu.text.Collator

関連用語

アルゴリズム のその他の用語

Case Folding

Mapping characters to a common case form for case-insensitive comparison. More comprehensive …

Grapheme Cluster Boundary

Rules (UAX#29) for determining where one user-perceived character ends and another begins. …

NFC (Canonical Composition)

正規化形式C:分解してから正規再合成し、最短の形式を生成します。データの保存と交換に推奨されており、Webの標準形式です。

NFD (Canonical Decomposition)

正規化形式D:再合成せずに完全分解します。macOSのHFS+ファイルシステムで使われます。é(U+00E9)→ e + ◌́(U+0065 + U+0301)。

NFKC (Compatibility Composition)

正規化形式KC:互換分解後に正規合成。視覚的に類似した文字を統合します(fi→fi、²→2、Ⅳ→IV)。識別子の比較に使われます。

NFKD (Compatibility Decomposition)

正規化形式KD:再合成せずに互換分解。最も強力な正規化で、最も多くの書式情報を失います。

String Comparison

Comparing Unicode strings requires normalization (NFC/NFD) and optionally collation (locale-aware sorting). Binary …

Unicode テキスト分割

テキストの境界を見つけるアルゴリズム:書記素クラスター・単語・文境界。カーソル移動・テキスト選択・テキスト処理に不可欠です。

Unicode 双方向アルゴリズム (UBA)

文字の双方向カテゴリと明示的な方向オーバーライドを使って、混在方向テキスト(例:英語+アラビア語)の表示順序を決定するアルゴリズム。

Unicode 正規化

Unicodeテキストを標準的な正規形に変換するプロセス。4つの形式:NFC(合成)、NFD(分解)、NFKC(互換合成)、NFKD(互換分解)。