What is 유니코드 정규화?

유니코드 텍스트를 표준 정규 형식으로 변환하는 과정. 네 가지 형식: NFC(합성), NFD(분해), NFKC(호환 합성), NFKD(호환 분해).

What is 대소문자 변환?

문자를 대문자, 소문자, 제목 대문자로 변환하는 규칙. 로케일에 따라 달라질 수 있으며(터키어 I 문제), 일대다 매핑도 가능합니다(ß → SS).

알고리즘

유니코드 정렬 알고리즘 (UCA)

기본 문자 → 악센트 → 대소문자 → 동점 비교자의 다단계 비교를 통해 유니코드 문자열을 비교하고 정렬하는 표준 알고리즘. 로케일 맞춤 설정이 가능합니다.

2022-08-29 · Updated 2024-08-22

Sorting Strings Across Languages

ASCII sorting is simple: compare byte values. But for multilingual text, byte-order sorting produces absurd results: "ä" sorts after "z" in ASCII, "纳" appears nowhere near "那" despite being phonetically similar, and "naïve" sorts far from "naive". The Unicode Collation Algorithm (UCA), specified in Unicode Technical Standard #10, provides a framework for language-aware sorting.

Multi-Level Comparison Keys

The UCA uses up to four comparison levels, applied in order. If two strings are equal at level 1, level 2 is consulted, and so on:

Level	Name	Distinguishes
L1	Primary	Base characters (a ≠ b, a = á at this level)
L2	Secondary	Accents / diacritics (a ≠ á)
L3	Tertiary	Case / variants (a ≠ A, ＡK ≠ AK)
L4	Quaternary	Punctuation, special marks

This means a case-insensitive, accent-insensitive search operates at level 1: "naïve" equals "naive" equals "NAIVE". A case-insensitive but accent-sensitive sort uses levels 1–2: "naive" < "naïve" because they differ at level 2.

import locale

# On Linux/macOS with proper locale support
locale.setlocale(locale.LC_ALL, "en_US.UTF-8")
words = ["naive", "naïve", "NAIVE", "résumé", "resume"]
words.sort(key=locale.strxfrm)
print(words)  # locale-aware sort

# For robust multilingual sorting, use the PyICU library
# pip install pyicu
from icu import Collator, Locale
collator = Collator.createInstance(Locale("en_US"))
words.sort(key=collator.getSortKey)

CLDR Tailorings

The UCA defines a default collation order (DUCET — Default Unicode Collation Element Table), but different locales require different sort orders. The Unicode Common Locale Data Repository (CLDR) provides tailorings that modify the UCA for specific languages:

In Swedish, v and w are treated as equivalent at the primary level (both sort before x)
In Spanish (traditional), ch sorts as a unit after all c words
In German (phone book order), ä = ae, so "Ärger" sorts with "Ärger" near "Aero"
In Japanese, hiragana and katakana may be treated as equivalent at level 1

Without CLDR tailorings, sorting Swedish or German text with the default UCA produces results that feel wrong to native speakers.

Practical Python Sorting

# Simple locale-aware sort (depends on OS locale support)
import locale
locale.setlocale(locale.LC_COLLATE, "de_DE.UTF-8")
german_words = ["Äpfel", "Apfel", "Zorn", "außen"]
german_words.sort(key=locale.strxfrm)

# Check if a string needs normalization before collation
import unicodedata
def collation_key(s: str) -> str:
    return locale.strxfrm(unicodedata.normalize("NFC", s))

Quick Facts

Property	Value
Specification	Unicode Technical Standard #10 (UTS #10)
Full name	Unicode Collation Algorithm (UCA)
Default table	DUCET (Default Unicode Collation Element Table)
Locale data	CLDR (Common Locale Data Repository)
Comparison levels	4 levels (primary, secondary, tertiary, quaternary)
Python (basic)	`locale.strxfrm()`
Python (full ICU)	`PyICU` library — `Collator.createInstance()`
Java / ICU	`java.text.Collator`, `com.ibm.icu.text.Collator`