The Developer's Unicode Handbook · الفصل 4
Search That Actually Works
Unicode-aware search requires case folding, accent-insensitive matching, and normalization. This chapter covers practical techniques for building search that works across all languages.
Search is where Unicode complexity becomes most visible to end users. A user searches for "cafe" and expects to find "café." A German user searches for "muller" and expects "Müller." A Japanese user types a phrase and expects matches regardless of the half-width/full-width distinction. Most search implementations silently fail all these cases. This chapter shows you how to build search that actually works.
The Foundation: Normalize Before You Search
Every Unicode-aware search implementation starts with normalization. Before indexing and before querying, normalize all text to the same form. This eliminates the NFC/NFD mismatch problem described in the previous chapter.
import unicodedata
import re
def normalize_for_search(text: str) -> str:
# Normalize text for indexing or querying.
# Step 1: Unicode normalize (NFKD for maximum compatibility)
# NFKD decomposes compatibility characters AND separates base+combining
normalized = unicodedata.normalize("NFKD", text)
return normalized
# Index time:
doc_text = "Résumé of François Müller"
indexed_text = normalize_for_search(doc_text)
# Query time (same normalization):
query = "resume"
normalized_query = normalize_for_search(query)
Case-Insensitive Search Across Languages
Case folding must happen after normalization, and must use .casefold() not .lower():
def search_normalize(text: str) -> str:
# Full normalization pipeline for case-insensitive search.
# 1. NFKD normalization (decomposes everything)
text = unicodedata.normalize("NFKD", text)
# 2. Case fold
text = text.casefold()
return text
def case_insensitive_contains(haystack: str, needle: str) -> bool:
return search_normalize(needle) in search_normalize(haystack)
# Works across many languages:
print(case_insensitive_contains("ÜNIVERSITÄT", "universitat")) # True
print(case_insensitive_contains("Ελλάδα", "ελλάδα")) # True
print(case_insensitive_contains("Straße", "strasse")) # True (German ß)
Accent-Insensitive Search: Folding Diacritics
After NFKD normalization, removing combining marks (diacritics) gives you accent-insensitive search. This lets "cafe" match "café" and "resume" match "résumé":
import unicodedata
import re
def fold_diacritics(text: str) -> str:
# Remove diacritical marks from text.
# NFKD decomposes characters into base + combining marks
decomposed = unicodedata.normalize("NFKD", text)
# Remove all combining characters (category Mn = Mark, Nonspacing)
folded = "".join(
c for c in decomposed
if unicodedata.category(c) != "Mn"
)
return folded
def accent_insensitive_contains(haystack: str, needle: str) -> bool:
return fold_diacritics(needle).casefold() in fold_diacritics(haystack).casefold()
# Test cases
print(accent_insensitive_contains("café au lait", "cafe")) # True
print(accent_insensitive_contains("Ñoño", "nono")) # True
print(accent_insensitive_contains("Ångström", "angstrom")) # True
print(accent_insensitive_contains("naïve", "naive")) # True
Caution: Accent-insensitive search is not always desirable. In Spanish, "año" (year) and "ano" (anus) are different words. In French, "ou" (or) and "où" (where) are different. For formal documents, never fold diacritics without user consent.
CJK Full-Text Search: No Word Boundaries
Chinese, Japanese, and Korean (CJK) text has no spaces between words. A search for "東京都" must match inside "東京都知事" without relying on word boundaries. This requires a different approach than the simple in operator:
# Naive approach works for CJK because we search for substrings
# but we still need normalization
def cjk_search_normalize(text: str) -> str:
# Normalization for CJK text search.
# NFKC converts fullwidth characters to halfwidth
# e.g., abc → abc, ① → 1
text = unicodedata.normalize("NFKC", text)
return text
query = "\\uff41\\uff42\\uff43" # abc (fullwidth)
doc = "abc123" # halfwidth
# Without normalization:
print(query in doc) # False
# With NFKC normalization:
print(cjk_search_normalize(query) in cjk_search_normalize(doc)) # True
For production CJK search, you need a tokenizer that understands word boundaries in Chinese and Japanese. The most common approach:
# Japanese tokenization with fugashi (MeCab wrapper)
# pip install fugashi unidic-lite
import fugashi
tagger = fugashi.Tagger()
def tokenize_japanese(text: str) -> list[str]:
return [word.surface for word in tagger(text)]
text = "東京都に住んでいます"
tokens = tokenize_japanese(text)
print(tokens) # ['東京', '都', 'に', '住ん', 'で', 'い', 'ます']
# Chinese tokenization with jieba
# pip install jieba
import jieba
def tokenize_chinese(text: str) -> list[str]:
return list(jieba.cut(text))
text = "我在北京大学学习"
tokens = tokenize_chinese(text)
print(tokens) # ['我', '在', '北京大学', '学习']
For Elasticsearch/OpenSearch, use the kuromoji analyzer for Japanese and smartcn for Chinese.
Regex Unicode Categories
The regex module (not the built-in re) supports Unicode properties like \\p{L} (any letter) and \\p{N} (any number). This is essential for language-agnostic text processing:
import regex # pip install regex
text = "Hello Привет 你好 مرحبا 123 ١٢٣"
# \\p{L} matches any Unicode letter
letters = regex.findall(r"\\p{L}+", text)
print(letters) # ['Hello', 'Привет', '你好', 'مرحبا']
# \\p{N} matches any Unicode number (including Arabic-Indic digits)
numbers = regex.findall(r"\\p{N}+", text)
print(numbers) # ['123', '١٢٣'] — Arabic-Indic digits included!
# \\p{Lu} = uppercase letter, \\p{Ll} = lowercase letter
# \\p{Zs} = space separator (includes non-breaking spaces)
# \\p{P} = punctuation
# \\p{So} = other symbols (includes many emoji)
# Word boundary that understands Unicode:
# \\b in Python's re only works on ASCII word chars by default
# With regex module:
text = "über-cool résumé"
words = regex.findall(r"\\p{L}+", text)
print(words) # ['über', 'cool', 'résumé']
Database Full-Text Search with Unicode
PostgreSQL's tsvector respects the language-specific configuration you set, but you need to set it correctly:
-- Create a text search index with language-specific stemming
CREATE INDEX idx_fts ON articles
USING GIN(to_tsvector('french', content));
-- Search with accent-insensitive matching
-- Install unaccent extension first:
CREATE EXTENSION unaccent;
CREATE TEXT SEARCH CONFIGURATION french_unaccent (COPY = french);
ALTER TEXT SEARCH CONFIGURATION french_unaccent
ALTER MAPPING FOR hword, hword_part, word
WITH unaccent, french_stem;
SELECT * FROM articles
WHERE to_tsvector('french_unaccent', content) @@ plainto_tsquery('french_unaccent', 'cafe');
-- Matches: café, cafés, Café — accent-insensitive + stemmed
For Elasticsearch, configure a custom analyzer that applies normalization, accent folding, and stemming in the correct order:
{
"settings": {
"analysis": {
"analyzer": {
"unicode_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"stop"
]
}
}
}
}
}
asciifolding does the same job as our fold_diacritics() function — converts accented characters to their ASCII base.
Trigram Search for Fuzzy Matching
For "did you mean" fuzzy search across Unicode text, trigrams work well because they're language-agnostic:
def get_trigrams(text: str) -> set[str]:
# Generate character trigrams from normalized text.
normalized = fold_diacritics(text).casefold()
padded = f" {normalized} "
return {padded[i:i+3] for i in range(len(padded) - 2)}
def trigram_similarity(a: str, b: str) -> float:
# Compute trigram similarity between two strings (0.0–1.0).
tri_a = get_trigrams(a)
tri_b = get_trigrams(b)
if not tri_a or not tri_b:
return 0.0
intersection = len(tri_a & tri_b)
union = len(tri_a | tri_b)
return intersection / union
print(trigram_similarity("café", "cafe")) # High similarity
print(trigram_similarity("Müller", "Muller")) # High similarity
print(trigram_similarity("hello", "world")) # Low similarity
PostgreSQL's pg_trgm extension provides trigram similarity search with GIN indexes — dramatically faster than the Python implementation for large datasets.
Building a Complete Search Pipeline
A production Unicode search pipeline for a multilingual application:
import unicodedata
import regex
def build_search_index_token(text: str, locale: str = "generic") -> str:
# Full search normalization pipeline.
Apply to both documents at index time and queries at search time.
# 1. Unicode normalization (NFKD separates base + combining)
text = unicodedata.normalize("NFKD", text)
# 2. Remove combining marks (diacritics)
text = "".join(c for c in text if unicodedata.category(c) != "Mn")
# 3. Case fold
text = text.casefold()
# 4. Collapse whitespace
text = regex.sub(r"\\s+", " ", text).strip()
return text
# Index time
documents = [
{"id": 1, "title": "Résumé of François"},
{"id": 2, "title": "Müller's Guide to Über-cool Tech"},
{"id": 3, "title": "日本語の文章"},
]
index = {
build_search_index_token(doc["title"]): doc
for doc in documents
}
# Search time
def search(query: str) -> list[dict]:
normalized_query = build_search_index_token(query)
return [doc for key, doc in index.items() if normalized_query in key]
print(search("resume")) # Finds "Résumé of François"
print(search("muller")) # Finds "Müller's Guide"
print(search("uber")) # Finds "über-cool"
The key insight is that normalization must be symmetric: whatever transformation you apply to documents at index time, you must apply identically to queries at search time. A mismatch between index-time and query-time normalization is the most common cause of "the document is there but search can't find it" bugs.