🔧 Practical Unicode

Unicode in Databases

Storing Unicode text in a database requires choosing the right charset, collation, and column type — a wrong choice can silently truncate emoji or produce incorrect sort orders. This guide covers Unicode database configuration for MySQL, PostgreSQL, and SQLite, including the utf8mb4 gotcha in MySQL and how to migrate existing data.

Published 2024-01-22 · Updated 2025-03-14

Databases are where text lives long-term, and getting Unicode support wrong at the database layer creates problems that are extraordinarily painful to fix after the fact. A column that silently truncates emoji, a collation that treats "a\u0308" and "a" as identical, or a connection that converts UTF-8 to Latin-1 behind your back -- these issues corrupt data permanently.

This guide covers the Unicode configuration you need to get right in MySQL, PostgreSQL, SQLite, and SQL Server, with emphasis on the specific settings that trip people up.

Core Concepts

Before diving into specific databases, understand three related but distinct concepts:

Character Set (Encoding)

The character set determines which characters can be stored and how they are encoded as bytes. For Unicode, you want UTF-8 (or its database-specific variant).

Collation

The collation determines how characters are compared and sorted. Two strings can contain the same characters but sort differently depending on the collation. Collation also controls case sensitivity, accent sensitivity, and Unicode normalization behavior.

Connection Encoding

The connection encoding tells the database how the client is sending and expecting data. Even if your table uses UTF-8, a Latin-1 connection will garble any non-ASCII data in transit.

All three must be consistent for Unicode to work correctly.

MySQL / MariaDB

MySQL's Unicode support has a notorious history. The original utf8 character set only supports characters in the Basic Multilingual Plane (U+0000 to U+FFFF) -- it uses a maximum of 3 bytes per character. This means emoji, historic scripts, and rare CJK characters are silently truncated or rejected.

Always Use utf8mb4

The utf8mb4 character set supports the full Unicode range (up to 4 bytes per character). It should be the default for all new projects.

-- Create a database with utf8mb4
CREATE DATABASE myapp
  CHARACTER SET utf8mb4
  COLLATE utf8mb4_unicode_ci;

-- Create a table with utf8mb4
CREATE TABLE users (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
    bio TEXT CHARACTER SET utf8mb4
);

utf8mb4 Collation Options

MySQL 8.0+ offers several utf8mb4 collations. Choose based on your needs:

Collation	Algorithm	Case	Accent	Speed	Use Case
utf8mb4_general_ci	Simplified	Insensitive	Insensitive	Fastest	Legacy, simple comparisons
utf8mb4_unicode_ci	UCA 4.0	Insensitive	Insensitive	Medium	General-purpose, better accuracy
utf8mb4_0900_ai_ci	UCA 9.0	Insensitive	Insensitive	Fast	Recommended default (MySQL 8.0+)
utf8mb4_0900_as_cs	UCA 9.0	Sensitive	Sensitive	Fast	Exact matching needed
utf8mb4_bin	Binary	Sensitive	Sensitive	Fastest	Code points, hashes, identifiers

The 0900 collations are based on Unicode Collation Algorithm (UCA) version 9.0 and are the best choice for MySQL 8.0+. For MariaDB, utf8mb4_unicode_ci remains the standard.

Connection Configuration

Setting the table character set is not enough. You must also configure the connection:

-- Set connection encoding (do this on every connection)
SET NAMES utf8mb4;

-- Or equivalently:
SET character_set_client = utf8mb4;
SET character_set_connection = utf8mb4;
SET character_set_results = utf8mb4;

In application code, set the connection charset in your database configuration:

# Django settings.py
DATABASES = {
    "default": {
        "ENGINE": "django.db.backends.mysql",
        "OPTIONS": {
            "charset": "utf8mb4",
        },
    },
}

# SQLAlchemy
engine = create_engine(
    "mysql+pymysql://user:pass@host/db?charset=utf8mb4"
)

// Node.js (mysql2)
const connection = mysql.createConnection({
    host: "localhost",
    user: "root",
    database: "myapp",
    charset: "utf8mb4",
});

Migrating from utf8 to utf8mb4

If you have an existing database using MySQL's utf8 (3-byte):

-- Convert database default
ALTER DATABASE myapp CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

-- Convert each table
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

-- Verify
SELECT TABLE_NAME, TABLE_COLLATION
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'myapp';

Warning: Converting to utf8mb4 increases the maximum byte length of indexed VARCHAR columns. If you have a VARCHAR(255) with a unique index under utf8 (max 767 bytes), it becomes 1020 bytes under utf8mb4, which may exceed the index length limit on older MySQL versions. MySQL 5.7.7+ with innodb_large_prefix=ON (default in 8.0) handles this.

MySQL Emoji Test

The quickest way to verify your Unicode setup is an emoji round-trip:

INSERT INTO users (name) VALUES ('Test \U0001f600');
SELECT name, HEX(name) FROM users WHERE name LIKE '%\U0001f600%';
-- Should return the row with the emoji intact
-- HEX should show: 54657374 20 F09F9880

If the emoji is silently dropped or replaced with ?, your character set is wrong.

PostgreSQL

PostgreSQL has had solid Unicode support from the start. The default encoding for new databases is UTF8, and it handles the full Unicode range without any of MySQL's historical gotchas.

Database Encoding

-- Create a UTF-8 database (this is typically the default)
CREATE DATABASE myapp
  ENCODING 'UTF8'
  LC_COLLATE 'en_US.UTF-8'
  LC_CTYPE 'en_US.UTF-8'
  TEMPLATE template0;

-- Check current encoding
SELECT pg_encoding_to_char(encoding) FROM pg_database WHERE datname = 'myapp';
-- Should return: UTF8

Collation

PostgreSQL uses the operating system's locale for collation by default. For predictable Unicode sorting, use ICU collations (available in PostgreSQL 10+):

-- Create a table with an ICU collation
CREATE TABLE articles (
    id SERIAL PRIMARY KEY,
    title TEXT COLLATE "und-x-icu"  -- Unicode default collation via ICU
);

-- Or specify a language-specific collation
CREATE TABLE german_words (
    id SERIAL PRIMARY KEY,
    word TEXT COLLATE "de-x-icu"  -- German sorting rules (a\u0308 sorts after a)
);

PostgreSQL 15+ supports ICU as the default collation provider for the entire database:

CREATE DATABASE myapp
  ENCODING 'UTF8'
  LOCALE_PROVIDER icu
  ICU_LOCALE 'und'
  TEMPLATE template0;

Connection Encoding

PostgreSQL connections default to the database encoding. You can verify and set it:

-- Check connection encoding
SHOW client_encoding;  -- Should be 'UTF8'

-- Set explicitly (rarely needed)
SET client_encoding TO 'UTF8';

PostgreSQL Unicode Operations

-- String length in characters (not bytes)
SELECT length('e\u0301te\u0301');           -- 3 (characters)
SELECT octet_length('e\u0301te\u0301');     -- 5 (bytes, UTF-8)

-- Unicode normalization (PostgreSQL 13+)
SELECT normalize('e\u0301', NFC);      -- Returns precomposed form
SELECT normalize('e\u0301', NFD);      -- Returns decomposed form

-- Check if normalized
SELECT is_normalized('e\u0301', NFC);  -- true

Text Search and Unicode

PostgreSQL's full-text search handles Unicode well, but configure the correct dictionary for your language:

-- Create a text search configuration for multilingual content
CREATE TEXT SEARCH CONFIGURATION multilingual (COPY = simple);

-- Search with Unicode text
SELECT * FROM articles
WHERE to_tsvector('simple', title) @@ plainto_tsquery('simple', 'u\u0308bung');

SQLite

SQLite stores text as UTF-8 by default and supports the full Unicode range. However, SQLite's built-in string functions are ASCII-only unless you compile with ICU support.

Default Behavior

-- SQLite always uses UTF-8 for TEXT columns
CREATE TABLE notes (
    id INTEGER PRIMARY KEY,
    content TEXT
);

-- Insert Unicode text (works fine)
INSERT INTO notes (content) VALUES ('Hello \U0001f30d \u4f60\u597d');

-- Length function counts bytes, not characters (by default)
SELECT length('e\u0301te\u0301');  -- Returns 5 (bytes), not 3 (characters)

-- UPPER/LOWER only work for ASCII by default
SELECT upper('e\u0301te\u0301');  -- Returns 'e\u0301te\u0301' (unchanged, not 'E\u0301TE\u0301')

Enabling ICU Support

For proper Unicode case folding, collation, and string functions, compile SQLite with ICU:

# Python: Use the icu extension if available
import sqlite3

conn = sqlite3.connect(":memory:")

# Check if ICU is available
try:
    conn.execute("SELECT icu_load_collation('en_US', 'english')")
    print("ICU available")
except sqlite3.OperationalError:
    print("ICU not available -- Unicode functions are limited")

Collation in SQLite

Without ICU, SQLite offers three built-in collations:

Collation	Behavior
BINARY	Byte-by-byte comparison (default)
NOCASE	ASCII case-insensitive only
RTRIM	Like BINARY, but ignores trailing spaces

For Unicode-aware collation, use a custom collation function in your application:

import sqlite3
import unicodedata

def unicode_nocase(a: str, b: str) -> int:
    a_folded = unicodedata.normalize("NFC", a.casefold())
    b_folded = unicodedata.normalize("NFC", b.casefold())
    return (a_folded > b_folded) - (a_folded < b_folded)

conn = sqlite3.connect(":memory:")
conn.create_collation("UNICODE_NOCASE", unicode_nocase)

conn.execute("CREATE TABLE words (word TEXT COLLATE UNICODE_NOCASE)")
conn.execute("INSERT INTO words VALUES ('u\u0308bung')")
conn.execute("INSERT INTO words VALUES ('U\u0308BUNG')")

# Both match a case-insensitive search
rows = conn.execute("SELECT * FROM words WHERE word = 'U\u0308bung'").fetchall()
print(len(rows))  # 2

SQL Server

SQL Server uses nchar, nvarchar, and ntext types for Unicode data. These store text as UTF-16LE internally.

Use nvarchar, Not varchar

-- varchar: stores in the code page of the column's collation (may lose Unicode)
-- nvarchar: stores as UTF-16, supports full Unicode
CREATE TABLE users (
    id INT IDENTITY PRIMARY KEY,
    name NVARCHAR(255),          -- Unicode
    legacy_name VARCHAR(255)     -- Non-Unicode, avoid for international data
);

-- Always prefix string literals with N for Unicode
INSERT INTO users (name) VALUES (N'e\u0301te\u0301 \U0001f600');

Collation

SQL Server 2019+ supports UTF-8 collations:

-- UTF-8 collation (SQL Server 2019+)
CREATE DATABASE myapp COLLATE Latin1_General_100_CI_AS_SC_UTF8;

-- With UTF-8 collation, varchar stores UTF-8 (not code page)
-- This saves space for mostly-ASCII data compared to nvarchar (UTF-16)
CREATE TABLE articles (
    id INT IDENTITY PRIMARY KEY,
    title VARCHAR(255) COLLATE Latin1_General_100_CI_AS_SC_UTF8
);

Supplementary Character Support

For emoji and characters above U+FFFF, use _SC (Supplementary Characters) collations:

-- Without _SC: supplementary characters treated as two separate units
-- With _SC: supplementary characters treated as single characters
SELECT LEN(N'\U0001f600') -- Without _SC: returns 2 (surrogate pair)
                       -- With _SC: returns 1 (single character)

Cross-Database Best Practices

Regardless of which database you use, follow these principles:

1. Set UTF-8 at Every Layer

Layer	MySQL	PostgreSQL	SQLite	SQL Server
Database	utf8mb4	UTF8	(default)	UTF8 collation
Table	utf8mb4	(inherits)	(default)	nvarchar
Connection	`SET NAMES utf8mb4`	(default)	(default)	N prefix
Application	`charset=utf8mb4`	(default)	(default)	Driver setting

2. Test with Edge Cases

Always verify your setup with these test strings:

ASCII:       Hello World
Accented:    e\u0301te\u0301 nino\u0303 u\u0308bung cafe\u0301
CJK:         \u4e2d\u6587\u6d4b\u8bd5 \u65e5\u672c\u8a9e\u30c6\u30b9\u30c8
Emoji:       \U0001f600\U0001f680\U0001f30d
Mixed:       User: \u5f20\u4f1f (\U0001f600) from Berlin (u\u0308ber!)
4-byte:      \U0001f600 \U0001d11e (musical symbol) \U00020000 (CJK ext B)

3. Handle Normalization

Databases generally store text as-is without normalizing. Two strings that look identical may differ at the byte level if one uses NFC and the other NFD. Consider normalizing on input:

import unicodedata

def normalize_for_storage(text: str) -> str:
    return unicodedata.normalize("NFC", text)

4. Be Cautious with String Length

A VARCHAR(255) column may mean 255 bytes or 255 characters depending on the database and encoding:

Database	VARCHAR(255) Means
MySQL (utf8mb4)	255 characters (up to 1020 bytes)
PostgreSQL	255 characters
SQLite	No length limit (255 is advisory)
SQL Server (nvarchar)	255 characters (510 bytes, UTF-16)
SQL Server (varchar UTF-8)	255 bytes

Ещё в Practical Unicode

How to Type Special Characters on Windows

Windows provides several methods for typing special characters and Unicode symbols, including …

How to Type Special Characters on Mac

macOS makes it easy to type special characters and Unicode symbols through …

How to Type Special Characters on Linux

Linux offers multiple ways to insert Unicode characters, including Ctrl+Shift+U followed by …

Special Characters on Mobile (iOS/Android)

Typing special Unicode characters on smartphones requires different techniques than on desktop …

How to Fix Mojibake (Garbled Text)

Mojibake is the garbled text you see when a file encoded in …

Unicode in Filenames

Modern operating systems support Unicode filenames, but different filesystems use different encodings …

Unicode in Email

Email evolved from ASCII-only systems, and supporting Unicode in email subjects, bodies, …

Unicode in Domain Names (IDN)

Internationalized Domain Names (IDNs) allow domain names to contain non-ASCII characters from …

Unicode for Accessibility

Using Unicode symbols, special characters, and emoji in web content has important …

Unicode Text Direction: LTR vs RTL

Unicode supports both left-to-right and right-to-left text through the bidirectional algorithm and …

Unicode Fonts: How Characters Get Rendered

A font file only contains glyphs for a subset of Unicode characters, …

How to Find Any Unicode Character

Finding the exact Unicode character you need can be challenging given over …

Unicode Copy and Paste Best Practices

Copying and pasting text between applications can introduce invisible characters, change normalization …

How to Create Fancy Text with Unicode

Unicode's Mathematical Alphanumeric Symbols block and other areas contain bold, italic, script, …

← Вернуться к руководствам