Unicode for the Modern Web · Глава 5

Databases and Unicode: Collation Matters

MySQL's utf8 vs utf8mb4 gotcha has bitten countless developers. This chapter covers proper database encoding setup, PostgreSQL ICU collation, case-insensitive queries, and sorting internationalized data.

~3 500 слов · ~14 мин чтения · · Updated

Database Unicode configuration is the silent killer of multilingual applications. A poorly configured database silently truncates emoji, corrupts Arabic text, or produces collation bugs that make records unfindable even when they clearly exist. Most of these issues have been lurking in production systems for years without anyone noticing — until a user sends an emoji in a product review or the search feature fails to find records with accented names. This chapter covers the full Unicode story for the three most widely deployed relational databases.

MySQL: The utf8 vs utf8mb4 Disaster

MySQL's greatest Unicode mistake was naming an incomplete encoding utf8. MySQL's utf8 character set is a 3-byte-maximum encoding — it cannot store characters above U+FFFF. Every emoji (which starts at U+1F600 in the supplementary planes) requires 4 bytes in real UTF-8. When you insert an emoji into a utf8 column, MySQL silently truncates the string at the emoji and stores nothing after it. No error. No warning. Just silent data loss.

The fix, available since MySQL 5.5.3 (2010), is utf8mb4:

-- Check current database charset
SHOW CREATE DATABASE myapp;

-- Create database with proper encoding
CREATE DATABASE myapp
  CHARACTER SET utf8mb4
  COLLATION utf8mb4_unicode_ci;

-- Convert existing database
ALTER DATABASE myapp
  CHARACTER SET utf8mb4
  COLLATION utf8mb4_unicode_ci;

-- Convert existing table
ALTER TABLE users
  CONVERT TO CHARACTER SET utf8mb4
  COLLATE utf8mb4_unicode_ci;

-- Convert specific column
ALTER TABLE posts
  MODIFY COLUMN body TEXT
  CHARACTER SET utf8mb4
  COLLATE utf8mb4_unicode_ci;

Also configure the MySQL connection to use utf8mb4:

# /etc/mysql/my.cnf or ~/.my.cnf
[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

In your application, set the connection charset explicitly:

# Django settings.py
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'OPTIONS': {
            'charset': 'utf8mb4',
        },
    }
}

MySQL Collation: utf8mb4_unicode_ci vs utf8mb4_0900_ai_ci

Collation controls sorting and comparison. The two most important MySQL 8 collations:

Collation Unicode version Algorithm Case Accent
utf8mb4_unicode_ci 5.2 (2009) UCA 5.2 insensitive insensitive
utf8mb4_0900_ai_ci 9.0 (modern) UCA 9.0 insensitive insensitive
utf8mb4_0900_as_cs 9.0 UCA 9.0 sensitive sensitive
utf8mb4_bin binary sensitive sensitive

utf8mb4_0900_ai_ci is the MySQL 8 default and uses a more current Unicode algorithm. ai means accent-insensitive (café = cafe), ci means case-insensitive (A = a). For case-sensitive and accent-sensitive storage, use utf8mb4_0900_as_cs.

-- Full-text search with unicode collation
CREATE FULLTEXT INDEX idx_body ON posts(body)
  WITH PARSER ngram;  -- for CJK text (no spaces between words)

SELECT * FROM posts
WHERE MATCH(body) AGAINST ('검색어' IN BOOLEAN MODE);

PostgreSQL: Native UTF-8 Done Right

PostgreSQL handles Unicode correctly by default. When you create a database with ENCODING = 'UTF8' (the default for most installations), PostgreSQL stores and retrieves full UTF-8 including 4-byte sequences:

-- Create database (UTF8 is default in most installs)
CREATE DATABASE myapp
  ENCODING = 'UTF8'
  LC_COLLATE = 'en_US.UTF-8'
  LC_CTYPE = 'en_US.UTF-8';

-- Verify encoding
SELECT datname, pg_encoding_to_char(encoding) FROM pg_database;

PostgreSQL collation is based on the operating system's locale library (libc or ICU). For Unicode-correct collation, use ICU collations (PostgreSQL 10+):

-- Create ICU-based collation for German
CREATE COLLATION german_icu (
  provider = icu,
  locale = 'de-u-ks-level2'  -- BCP 47 with UCA options
);

-- Use in column definition
CREATE TABLE products (
  name TEXT COLLATE german_icu
);

-- Full-text search with language configuration
CREATE INDEX idx_fts ON articles
  USING GIN(to_tsvector('german', body));

SELECT * FROM articles
WHERE to_tsvector('german', body) @@ to_tsquery('german', 'Käse');

PostgreSQL's pg_trgm extension enables fuzzy search and handles Unicode correctly:

CREATE EXTENSION pg_trgm;
CREATE INDEX idx_name_trgm ON users USING GIN(name gin_trgm_ops);

-- Finds "Müller", "muller", "Muller" with similarity threshold
SELECT * FROM users
WHERE similarity(name, 'Muller') > 0.4
ORDER BY similarity(name, 'Muller') DESC;

SQL Server: nvarchar vs varchar

SQL Server distinguishes Unicode (nvarchar, nchar, ntext) from non-Unicode (varchar, char, text) column types. The non-Unicode types use the database's default code page (often Windows-1252), which cannot store most Unicode characters:

-- WRONG: varchar stores only code page characters
CREATE TABLE users_bad (
  name VARCHAR(255)  -- cannot store Arabic, CJK, emoji
);

-- CORRECT: nvarchar stores full Unicode (UCS-2/UTF-16 internally)
CREATE TABLE users_good (
  name NVARCHAR(255)  -- stores any BMP character
);

-- Literals must use N prefix for Unicode strings
INSERT INTO users_good VALUES (N'こんにちは');  -- N prefix required!
INSERT INTO users_good VALUES ('こんにちは');   -- WRONG: stored as ? marks

SQL Server 2019 added UTF-8 support for varchar columns using _UTF8 collations:

-- SQL Server 2019+: UTF-8 varchar (stores 4-byte emoji!)
ALTER TABLE users
  ALTER COLUMN name VARCHAR(255) COLLATE Latin1_General_100_CI_AS_SC_UTF8;

SQL Server collation names encode multiple properties: - CI — case-insensitive - CS — case-sensitive - AI — accent-insensitive - AS — accent-sensitive - SC — supplementary characters (required for emoji support) - UTF8 — UTF-8 encoding (SQL Server 2019+)

Normalization at the Database Layer

Should you normalize Unicode (NFC/NFD) before storing in the database? Yes. Store NFC-normalized strings and normalize before querying:

# Django model: normalize on save
import unicodedata

class UserProfile(models.Model):
    name = models.CharField(max_length=200)

    def save(self, *args, **kwargs):
        self.name = unicodedata.normalize('NFC', self.name)
        super().save(*args, **kwargs)
-- PostgreSQL: normalize function (PostgreSQL 13+)
INSERT INTO users (name)
VALUES (normalize('e\u0301', NFC));  -- stores é (U+00E9)

-- Check normalization form
SELECT is_normalized('café', NFC);  -- true if already NFC

Without normalization, searches fail: a user who stores their name as café (NFD) cannot be found by searching for café (NFC), even though they look identical.

Django ORM Considerations

Django with PostgreSQL uses UTF-8 automatically. With MySQL, you must configure the connection charset and ensure migrations create tables with utf8mb4:

# Django + MySQL: force utf8mb4 for all new tables
# migrations/0001_initial.py
from django.db import migrations

class Migration(migrations.Migration):
    operations = [
        migrations.RunSQL(
            "ALTER DATABASE CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci",
            reverse_sql=migrations.RunSQL.noop,
        ),
    ]

For case-insensitive Django lookups that respect Unicode, use icontains (which uses the database collation) rather than Python-level lower():

# Correct: database-level case-insensitive search using collation
User.objects.filter(name__icontains='münchen')

# Risky: Python lower() doesn't handle Turkish dotless i or German ß correctly
User.objects.filter(name__lower='münchen')  # not a real Django lookup

Indexing Unicode Columns

Unicode columns with high cardinality (name, email, title) benefit from indexes, but be aware:

  • MySQL: indexes on utf8mb4 columns use 4 bytes per character. A VARCHAR(255) index uses up to 1020 bytes per row, which may exceed the InnoDB index key limit (767 bytes for older configurations). Use prefix indexes for long columns: CREATE INDEX idx_name ON users (name(191));
  • PostgreSQL: no such limit; full column indexing works fine. Use expression indexes for case-insensitive searches: CREATE INDEX idx_name_lower ON users (lower(name));
  • SQL Server: nvarchar columns in indexes use 2 bytes per character; nvarchar(max) cannot be fully indexed.

The investment in correct Unicode database configuration pays dividends for the entire lifetime of your application. Fix it during development, not after a user reports that their carefully typed name is stored as question marks.