Search That Actually Works — The Developer's Unicode Handbook

Search is where Unicode complexity becomes most visible to end users. A user searches for "cafe" and expects to find "café." A German user searches for "muller" and expects "Müller." A Japanese user types a phrase and expects matches regardless of the half-width/full-width distinction. Most search implementations silently fail all these cases. This chapter shows you how to build search that actually works.

The Foundation: Normalize Before You Search

Every Unicode-aware search implementation starts with normalization. Before indexing and before querying, normalize all text to the same form. This eliminates the NFC/NFD mismatch problem described in the previous chapter.

import unicodedata
import re

def normalize_for_search(text: str) -> str:
    # Normalize text for indexing or querying.
    # Step 1: Unicode normalize (NFKD for maximum compatibility)
    # NFKD decomposes compatibility characters AND separates base+combining
    normalized = unicodedata.normalize("NFKD", text)
    return normalized

# Index time:
doc_text = "Résumé of François Müller"
indexed_text = normalize_for_search(doc_text)

# Query time (same normalization):
query = "resume"
normalized_query = normalize_for_search(query)

Case-Insensitive Search Across Languages

Case folding must happen after normalization, and must use .casefold() not .lower():

def search_normalize(text: str) -> str:
    # Full normalization pipeline for case-insensitive search.
    # 1. NFKD normalization (decomposes everything)
    text = unicodedata.normalize("NFKD", text)
    # 2. Case fold
    text = text.casefold()
    return text

def case_insensitive_contains(haystack: str, needle: str) -> bool:
    return search_normalize(needle) in search_normalize(haystack)

# Works across many languages:
print(case_insensitive_contains("ÜNIVERSITÄT", "universitat"))  # True
print(case_insensitive_contains("Ελλάδα", "ελλάδα"))            # True
print(case_insensitive_contains("Straße", "strasse"))            # True (German ß)

Accent-Insensitive Search: Folding Diacritics

After NFKD normalization, removing combining marks (diacritics) gives you accent-insensitive search. This lets "cafe" match "café" and "resume" match "résumé":

import unicodedata
import re

def fold_diacritics(text: str) -> str:
    # Remove diacritical marks from text.
    # NFKD decomposes characters into base + combining marks
    decomposed = unicodedata.normalize("NFKD", text)
    # Remove all combining characters (category Mn = Mark, Nonspacing)
    folded = "".join(
        c for c in decomposed
        if unicodedata.category(c) != "Mn"
    )
    return folded

def accent_insensitive_contains(haystack: str, needle: str) -> bool:
    return fold_diacritics(needle).casefold() in fold_diacritics(haystack).casefold()

# Test cases
print(accent_insensitive_contains("café au lait", "cafe"))        # True
print(accent_insensitive_contains("Ñoño", "nono"))                # True
print(accent_insensitive_contains("Ångström", "angstrom"))        # True
print(accent_insensitive_contains("naïve", "naive"))              # True

Caution: Accent-insensitive search is not always desirable. In Spanish, "año" (year) and "ano" (anus) are different words. In French, "ou" (or) and "où" (where) are different. For formal documents, never fold diacritics without user consent.

CJK Full-Text Search: No Word Boundaries

Chinese, Japanese, and Korean (CJK) text has no spaces between words. A search for "東京都" must match inside "東京都知事" without relying on word boundaries. This requires a different approach than the simple in operator:

# Naive approach works for CJK because we search for substrings
# but we still need normalization

def cjk_search_normalize(text: str) -> str:
    # Normalization for CJK text search.
    # NFKC converts fullwidth characters to halfwidth
    # e.g., ａｂｃ → abc, ① → 1
    text = unicodedata.normalize("NFKC", text)
    return text

query = "\\uff41\\uff42\\uff43"  # ａｂｃ (fullwidth)
doc = "abc123"                  # halfwidth

# Without normalization:
print(query in doc)                              # False
# With NFKC normalization:
print(cjk_search_normalize(query) in cjk_search_normalize(doc))  # True

For production CJK search, you need a tokenizer that understands word boundaries in Chinese and Japanese. The most common approach:

# Japanese tokenization with fugashi (MeCab wrapper)
# pip install fugashi unidic-lite
import fugashi

tagger = fugashi.Tagger()

def tokenize_japanese(text: str) -> list[str]:
    return [word.surface for word in tagger(text)]

text = "東京都に住んでいます"
tokens = tokenize_japanese(text)
print(tokens)  # ['東京', '都', 'に', '住ん', 'で', 'い', 'ます']

# Chinese tokenization with jieba
# pip install jieba
import jieba

def tokenize_chinese(text: str) -> list[str]:
    return list(jieba.cut(text))

text = "我在北京大学学习"
tokens = tokenize_chinese(text)
print(tokens)  # ['我', '在', '北京大学', '学习']

For Elasticsearch/OpenSearch, use the kuromoji analyzer for Japanese and smartcn for Chinese.

Regex Unicode Categories

The regex module (not the built-in re) supports Unicode properties like \\p{L} (any letter) and \\p{N} (any number). This is essential for language-agnostic text processing:

import regex  # pip install regex

text = "Hello Привет 你好 مرحبا 123 ١٢٣"

# \\p{L} matches any Unicode letter
letters = regex.findall(r"\\p{L}+", text)
print(letters)  # ['Hello', 'Привет', '你好', 'مرحبا']

# \\p{N} matches any Unicode number (including Arabic-Indic digits)
numbers = regex.findall(r"\\p{N}+", text)
print(numbers)  # ['123', '١٢٣']  — Arabic-Indic digits included!

# \\p{Lu} = uppercase letter, \\p{Ll} = lowercase letter
# \\p{Zs} = space separator (includes non-breaking spaces)
# \\p{P} = punctuation
# \\p{So} = other symbols (includes many emoji)

# Word boundary that understands Unicode:
# \\b in Python's re only works on ASCII word chars by default
# With regex module:
text = "über-cool résumé"
words = regex.findall(r"\\p{L}+", text)
print(words)  # ['über', 'cool', 'résumé']

Database Full-Text Search with Unicode

PostgreSQL's tsvector respects the language-specific configuration you set, but you need to set it correctly:

-- Create a text search index with language-specific stemming
CREATE INDEX idx_fts ON articles
USING GIN(to_tsvector('french', content));

-- Search with accent-insensitive matching
-- Install unaccent extension first:
CREATE EXTENSION unaccent;
CREATE TEXT SEARCH CONFIGURATION french_unaccent (COPY = french);
ALTER TEXT SEARCH CONFIGURATION french_unaccent
    ALTER MAPPING FOR hword, hword_part, word
    WITH unaccent, french_stem;

SELECT * FROM articles
WHERE to_tsvector('french_unaccent', content) @@ plainto_tsquery('french_unaccent', 'cafe');
-- Matches: café, cafés, Café — accent-insensitive + stemmed

For Elasticsearch, configure a custom analyzer that applies normalization, accent folding, and stemming in the correct order:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "unicode_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding",
            "stop"
          ]
        }
      }
    }
  }
}

asciifolding does the same job as our fold_diacritics() function — converts accented characters to their ASCII base.

Trigram Search for Fuzzy Matching

For "did you mean" fuzzy search across Unicode text, trigrams work well because they're language-agnostic:

def get_trigrams(text: str) -> set[str]:
    # Generate character trigrams from normalized text.
    normalized = fold_diacritics(text).casefold()
    padded = f"  {normalized}  "
    return {padded[i:i+3] for i in range(len(padded) - 2)}

def trigram_similarity(a: str, b: str) -> float:
    # Compute trigram similarity between two strings (0.0–1.0).
    tri_a = get_trigrams(a)
    tri_b = get_trigrams(b)
    if not tri_a or not tri_b:
        return 0.0
    intersection = len(tri_a & tri_b)
    union = len(tri_a | tri_b)
    return intersection / union

print(trigram_similarity("café", "cafe"))        # High similarity
print(trigram_similarity("Müller", "Muller"))    # High similarity
print(trigram_similarity("hello", "world"))      # Low similarity

PostgreSQL's pg_trgm extension provides trigram similarity search with GIN indexes — dramatically faster than the Python implementation for large datasets.

Building a Complete Search Pipeline

A production Unicode search pipeline for a multilingual application:

import unicodedata
import regex

def build_search_index_token(text: str, locale: str = "generic") -> str:
    # Full search normalization pipeline.
    Apply to both documents at index time and queries at search time.
    # 1. Unicode normalization (NFKD separates base + combining)
    text = unicodedata.normalize("NFKD", text)

    # 2. Remove combining marks (diacritics)
    text = "".join(c for c in text if unicodedata.category(c) != "Mn")

    # 3. Case fold
    text = text.casefold()

    # 4. Collapse whitespace
    text = regex.sub(r"\\s+", " ", text).strip()

    return text

# Index time
documents = [
    {"id": 1, "title": "Résumé of François"},
    {"id": 2, "title": "Müller's Guide to Über-cool Tech"},
    {"id": 3, "title": "日本語の文章"},
]

index = {
    build_search_index_token(doc["title"]): doc
    for doc in documents
}

# Search time
def search(query: str) -> list[dict]:
    normalized_query = build_search_index_token(query)
    return [doc for key, doc in index.items() if normalized_query in key]

print(search("resume"))    # Finds "Résumé of François"
print(search("muller"))    # Finds "Müller's Guide"
print(search("uber"))      # Finds "über-cool"

The key insight is that normalization must be symmetric: whatever transformation you apply to documents at index time, you must apply identically to queries at search time. A mismatch between index-time and query-time normalization is the most common cause of "the document is there but search can't find it" bugs.