Unicode in Databases
Storing Unicode text in a database requires choosing the right charset, collation, and column type — a wrong choice can silently truncate emoji or produce incorrect sort orders. This guide covers Unicode database configuration for MySQL, PostgreSQL, and SQLite, including the utf8mb4 gotcha in MySQL and how to migrate existing data.
Databases are where text lives long-term, and getting Unicode support wrong at the database layer creates problems that are extraordinarily painful to fix after the fact. A column that silently truncates emoji, a collation that treats "a\u0308" and "a" as identical, or a connection that converts UTF-8 to Latin-1 behind your back -- these issues corrupt data permanently.
This guide covers the Unicode configuration you need to get right in MySQL, PostgreSQL, SQLite, and SQL Server, with emphasis on the specific settings that trip people up.
Core Concepts
Before diving into specific databases, understand three related but distinct concepts:
Character Set (Encoding)
The character set determines which characters can be stored and how they are encoded as bytes. For Unicode, you want UTF-8 (or its database-specific variant).
Collation
The collation determines how characters are compared and sorted. Two strings can contain the same characters but sort differently depending on the collation. Collation also controls case sensitivity, accent sensitivity, and Unicode normalization behavior.
Connection Encoding
The connection encoding tells the database how the client is sending and expecting data. Even if your table uses UTF-8, a Latin-1 connection will garble any non-ASCII data in transit.
All three must be consistent for Unicode to work correctly.
MySQL / MariaDB
MySQL's Unicode support has a notorious history. The original utf8 character set only
supports characters in the Basic Multilingual Plane (U+0000 to U+FFFF) -- it uses a
maximum of 3 bytes per character. This means emoji, historic scripts, and rare CJK
characters are silently truncated or rejected.
Always Use utf8mb4
The utf8mb4 character set supports the full Unicode range (up to 4 bytes per character).
It should be the default for all new projects.
-- Create a database with utf8mb4
CREATE DATABASE myapp
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
-- Create a table with utf8mb4
CREATE TABLE users (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
bio TEXT CHARACTER SET utf8mb4
);
utf8mb4 Collation Options
MySQL 8.0+ offers several utf8mb4 collations. Choose based on your needs:
| Collation | Algorithm | Case | Accent | Speed | Use Case |
|---|---|---|---|---|---|
| utf8mb4_general_ci | Simplified | Insensitive | Insensitive | Fastest | Legacy, simple comparisons |
| utf8mb4_unicode_ci | UCA 4.0 | Insensitive | Insensitive | Medium | General-purpose, better accuracy |
| utf8mb4_0900_ai_ci | UCA 9.0 | Insensitive | Insensitive | Fast | Recommended default (MySQL 8.0+) |
| utf8mb4_0900_as_cs | UCA 9.0 | Sensitive | Sensitive | Fast | Exact matching needed |
| utf8mb4_bin | Binary | Sensitive | Sensitive | Fastest | Code points, hashes, identifiers |
The 0900 collations are based on Unicode Collation Algorithm (UCA) version 9.0 and are
the best choice for MySQL 8.0+. For MariaDB, utf8mb4_unicode_ci remains the standard.
Connection Configuration
Setting the table character set is not enough. You must also configure the connection:
-- Set connection encoding (do this on every connection)
SET NAMES utf8mb4;
-- Or equivalently:
SET character_set_client = utf8mb4;
SET character_set_connection = utf8mb4;
SET character_set_results = utf8mb4;
In application code, set the connection charset in your database configuration:
# Django settings.py
DATABASES = {
"default": {
"ENGINE": "django.db.backends.mysql",
"OPTIONS": {
"charset": "utf8mb4",
},
},
}
# SQLAlchemy
engine = create_engine(
"mysql+pymysql://user:pass@host/db?charset=utf8mb4"
)
// Node.js (mysql2)
const connection = mysql.createConnection({
host: "localhost",
user: "root",
database: "myapp",
charset: "utf8mb4",
});
Migrating from utf8 to utf8mb4
If you have an existing database using MySQL's utf8 (3-byte):
-- Convert database default
ALTER DATABASE myapp CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
-- Convert each table
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
-- Verify
SELECT TABLE_NAME, TABLE_COLLATION
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'myapp';
Warning: Converting to utf8mb4 increases the maximum byte length of indexed VARCHAR
columns. If you have a VARCHAR(255) with a unique index under utf8 (max 767 bytes),
it becomes 1020 bytes under utf8mb4, which may exceed the index length limit on older
MySQL versions. MySQL 5.7.7+ with innodb_large_prefix=ON (default in 8.0) handles this.
MySQL Emoji Test
The quickest way to verify your Unicode setup is an emoji round-trip:
INSERT INTO users (name) VALUES ('Test \U0001f600');
SELECT name, HEX(name) FROM users WHERE name LIKE '%\U0001f600%';
-- Should return the row with the emoji intact
-- HEX should show: 54657374 20 F09F9880
If the emoji is silently dropped or replaced with ?, your character set is wrong.
PostgreSQL
PostgreSQL has had solid Unicode support from the start. The default encoding for new
databases is UTF8, and it handles the full Unicode range without any of MySQL's
historical gotchas.
Database Encoding
-- Create a UTF-8 database (this is typically the default)
CREATE DATABASE myapp
ENCODING 'UTF8'
LC_COLLATE 'en_US.UTF-8'
LC_CTYPE 'en_US.UTF-8'
TEMPLATE template0;
-- Check current encoding
SELECT pg_encoding_to_char(encoding) FROM pg_database WHERE datname = 'myapp';
-- Should return: UTF8
Collation
PostgreSQL uses the operating system's locale for collation by default. For predictable Unicode sorting, use ICU collations (available in PostgreSQL 10+):
-- Create a table with an ICU collation
CREATE TABLE articles (
id SERIAL PRIMARY KEY,
title TEXT COLLATE "und-x-icu" -- Unicode default collation via ICU
);
-- Or specify a language-specific collation
CREATE TABLE german_words (
id SERIAL PRIMARY KEY,
word TEXT COLLATE "de-x-icu" -- German sorting rules (a\u0308 sorts after a)
);
PostgreSQL 15+ supports ICU as the default collation provider for the entire database:
CREATE DATABASE myapp
ENCODING 'UTF8'
LOCALE_PROVIDER icu
ICU_LOCALE 'und'
TEMPLATE template0;
Connection Encoding
PostgreSQL connections default to the database encoding. You can verify and set it:
-- Check connection encoding
SHOW client_encoding; -- Should be 'UTF8'
-- Set explicitly (rarely needed)
SET client_encoding TO 'UTF8';
PostgreSQL Unicode Operations
-- String length in characters (not bytes)
SELECT length('e\u0301te\u0301'); -- 3 (characters)
SELECT octet_length('e\u0301te\u0301'); -- 5 (bytes, UTF-8)
-- Unicode normalization (PostgreSQL 13+)
SELECT normalize('e\u0301', NFC); -- Returns precomposed form
SELECT normalize('e\u0301', NFD); -- Returns decomposed form
-- Check if normalized
SELECT is_normalized('e\u0301', NFC); -- true
Text Search and Unicode
PostgreSQL's full-text search handles Unicode well, but configure the correct dictionary for your language:
-- Create a text search configuration for multilingual content
CREATE TEXT SEARCH CONFIGURATION multilingual (COPY = simple);
-- Search with Unicode text
SELECT * FROM articles
WHERE to_tsvector('simple', title) @@ plainto_tsquery('simple', 'u\u0308bung');
SQLite
SQLite stores text as UTF-8 by default and supports the full Unicode range. However, SQLite's built-in string functions are ASCII-only unless you compile with ICU support.
Default Behavior
-- SQLite always uses UTF-8 for TEXT columns
CREATE TABLE notes (
id INTEGER PRIMARY KEY,
content TEXT
);
-- Insert Unicode text (works fine)
INSERT INTO notes (content) VALUES ('Hello \U0001f30d \u4f60\u597d');
-- Length function counts bytes, not characters (by default)
SELECT length('e\u0301te\u0301'); -- Returns 5 (bytes), not 3 (characters)
-- UPPER/LOWER only work for ASCII by default
SELECT upper('e\u0301te\u0301'); -- Returns 'e\u0301te\u0301' (unchanged, not 'E\u0301TE\u0301')
Enabling ICU Support
For proper Unicode case folding, collation, and string functions, compile SQLite with ICU:
# Python: Use the icu extension if available
import sqlite3
conn = sqlite3.connect(":memory:")
# Check if ICU is available
try:
conn.execute("SELECT icu_load_collation('en_US', 'english')")
print("ICU available")
except sqlite3.OperationalError:
print("ICU not available -- Unicode functions are limited")
Collation in SQLite
Without ICU, SQLite offers three built-in collations:
| Collation | Behavior |
|---|---|
| BINARY | Byte-by-byte comparison (default) |
| NOCASE | ASCII case-insensitive only |
| RTRIM | Like BINARY, but ignores trailing spaces |
For Unicode-aware collation, use a custom collation function in your application:
import sqlite3
import unicodedata
def unicode_nocase(a: str, b: str) -> int:
a_folded = unicodedata.normalize("NFC", a.casefold())
b_folded = unicodedata.normalize("NFC", b.casefold())
return (a_folded > b_folded) - (a_folded < b_folded)
conn = sqlite3.connect(":memory:")
conn.create_collation("UNICODE_NOCASE", unicode_nocase)
conn.execute("CREATE TABLE words (word TEXT COLLATE UNICODE_NOCASE)")
conn.execute("INSERT INTO words VALUES ('u\u0308bung')")
conn.execute("INSERT INTO words VALUES ('U\u0308BUNG')")
# Both match a case-insensitive search
rows = conn.execute("SELECT * FROM words WHERE word = 'U\u0308bung'").fetchall()
print(len(rows)) # 2
SQL Server
SQL Server uses nchar, nvarchar, and ntext types for Unicode data. These store text as UTF-16LE internally.
Use nvarchar, Not varchar
-- varchar: stores in the code page of the column's collation (may lose Unicode)
-- nvarchar: stores as UTF-16, supports full Unicode
CREATE TABLE users (
id INT IDENTITY PRIMARY KEY,
name NVARCHAR(255), -- Unicode
legacy_name VARCHAR(255) -- Non-Unicode, avoid for international data
);
-- Always prefix string literals with N for Unicode
INSERT INTO users (name) VALUES (N'e\u0301te\u0301 \U0001f600');
Collation
SQL Server 2019+ supports UTF-8 collations:
-- UTF-8 collation (SQL Server 2019+)
CREATE DATABASE myapp COLLATE Latin1_General_100_CI_AS_SC_UTF8;
-- With UTF-8 collation, varchar stores UTF-8 (not code page)
-- This saves space for mostly-ASCII data compared to nvarchar (UTF-16)
CREATE TABLE articles (
id INT IDENTITY PRIMARY KEY,
title VARCHAR(255) COLLATE Latin1_General_100_CI_AS_SC_UTF8
);
Supplementary Character Support
For emoji and characters above U+FFFF, use _SC (Supplementary Characters) collations:
-- Without _SC: supplementary characters treated as two separate units
-- With _SC: supplementary characters treated as single characters
SELECT LEN(N'\U0001f600') -- Without _SC: returns 2 (surrogate pair)
-- With _SC: returns 1 (single character)
Cross-Database Best Practices
Regardless of which database you use, follow these principles:
1. Set UTF-8 at Every Layer
| Layer | MySQL | PostgreSQL | SQLite | SQL Server |
|---|---|---|---|---|
| Database | utf8mb4 | UTF8 | (default) | UTF8 collation |
| Table | utf8mb4 | (inherits) | (default) | nvarchar |
| Connection | SET NAMES utf8mb4 |
(default) | (default) | N prefix |
| Application | charset=utf8mb4 |
(default) | (default) | Driver setting |
2. Test with Edge Cases
Always verify your setup with these test strings:
ASCII: Hello World
Accented: e\u0301te\u0301 nino\u0303 u\u0308bung cafe\u0301
CJK: \u4e2d\u6587\u6d4b\u8bd5 \u65e5\u672c\u8a9e\u30c6\u30b9\u30c8
Emoji: \U0001f600\U0001f680\U0001f30d
Mixed: User: \u5f20\u4f1f (\U0001f600) from Berlin (u\u0308ber!)
4-byte: \U0001f600 \U0001d11e (musical symbol) \U00020000 (CJK ext B)
3. Handle Normalization
Databases generally store text as-is without normalizing. Two strings that look identical may differ at the byte level if one uses NFC and the other NFD. Consider normalizing on input:
import unicodedata
def normalize_for_storage(text: str) -> str:
return unicodedata.normalize("NFC", text)
4. Be Cautious with String Length
A VARCHAR(255) column may mean 255 bytes or 255 characters depending on the database and encoding:
| Database | VARCHAR(255) Means |
|---|---|
| MySQL (utf8mb4) | 255 characters (up to 1020 bytes) |
| PostgreSQL | 255 characters |
| SQLite | No length limit (255 is advisory) |
| SQL Server (nvarchar) | 255 characters (510 bytes, UTF-16) |
| SQL Server (varchar UTF-8) | 255 bytes |
Mehr in Practical Unicode
Windows provides several methods for typing special characters and Unicode symbols, including …
macOS makes it easy to type special characters and Unicode symbols through …
Linux offers multiple ways to insert Unicode characters, including Ctrl+Shift+U followed by …
Typing special Unicode characters on smartphones requires different techniques than on desktop …
Mojibake is the garbled text you see when a file encoded in …
Modern operating systems support Unicode filenames, but different filesystems use different encodings …
Email evolved from ASCII-only systems, and supporting Unicode in email subjects, bodies, …
Internationalized Domain Names (IDNs) allow domain names to contain non-ASCII characters from …
Using Unicode symbols, special characters, and emoji in web content has important …
Unicode supports both left-to-right and right-to-left text through the bidirectional algorithm and …
A font file only contains glyphs for a subset of Unicode characters, …
Finding the exact Unicode character you need can be challenging given over …
Copying and pasting text between applications can introduce invisible characters, change normalization …
Unicode's Mathematical Alphanumeric Symbols block and other areas contain bold, italic, script, …