Unicode in SQL
SQL databases store text in encodings and collations that determine how characters are saved, compared, and sorted, with UTF-8 and UTF-16 being the most common choices. This guide covers Unicode in MySQL, PostgreSQL, and SQLite, explaining how to choose the right charset, collation, and column type for international data.
Database Unicode support is one of the most common sources of encoding bugs in production systems. A query that works perfectly in development can silently truncate emoji, mangle accented characters, or produce wrong sort orders when the database, table, or column uses the wrong character set or collation. This guide covers Unicode handling across the four most popular relational databases: MySQL, PostgreSQL, SQLite, and SQL Server.
The Core Problem
When you store the string "café🐍" in a database column, three things must agree:
- Character set — what set of characters can be stored (e.g., UTF-8)
- Encoding — how those characters are stored as bytes
- Collation — how strings are compared and sorted
If any one of these is wrong, you get data loss, mojibake, or incorrect query results.
MySQL / MariaDB
MySQL has the most confusing Unicode setup of any major database, primarily because of a historical naming mistake.
utf8 vs utf8mb4
MySQL's utf8 character set is not real UTF-8. It supports only up to 3
bytes per character, which means it cannot store characters above U+FFFF —
including all emoji, many CJK extension characters, and musical symbols.
| Character Set | Max Bytes | Unicode Range | Emoji Support |
|---|---|---|---|
utf8 (utf8mb3) |
3 | BMP only (U+0000–U+FFFF) | No |
utf8mb4 |
4 | Full Unicode (U+0000–U+10FFFF) | Yes |
Always use utf8mb4. The utf8 alias was a deliberate compromise made in
MySQL 4.1 (2004) to save one byte per character in index calculations. It has
caused untold data loss since then.
-- Create a database with full Unicode support
CREATE DATABASE myapp
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
-- Create a table
CREATE TABLE users (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
bio TEXT CHARACTER SET utf8mb4
);
-- Insert emoji safely
INSERT INTO users (name, bio) VALUES ('Alice', 'Loves snakes 🐍');
MySQL Collations
A collation determines how strings are compared and sorted:
| Collation | Behavior | Use Case |
|---|---|---|
utf8mb4_general_ci |
Fast, approximate Unicode sorting | Legacy apps |
utf8mb4_unicode_ci |
UCA-based, more correct sorting | Most applications |
utf8mb4_0900_ai_ci |
Unicode 9.0, accent-insensitive (MySQL 8.0+) | Modern apps |
utf8mb4_0900_as_cs |
Accent-sensitive, case-sensitive | Exact matching |
utf8mb4_bin |
Binary comparison (byte-by-byte) | Hash lookups, exact match |
The ci suffix means case-insensitive, cs means case-sensitive, ai means
accent-insensitive, as means accent-sensitive.
-- MySQL 8.0 default collation
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
-- Check current settings
SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';
Connection Charset
Setting the database to utf8mb4 is not enough — the connection must also
use utf8mb4:
SET NAMES utf8mb4;
-- or equivalently:
SET character_set_client = utf8mb4;
SET character_set_connection = utf8mb4;
SET character_set_results = utf8mb4;
In application code:
# Python (MySQLdb / PyMySQL)
connection = pymysql.connect(host='localhost', charset='utf8mb4')
# PHP (PDO)
$pdo = new PDO('mysql:host=localhost;dbname=test;charset=utf8mb4', $user, $pass);
# Node.js (mysql2)
const conn = await mysql.createConnection({ charset: 'utf8mb4' });
Migrating from utf8 to utf8mb4
-- Convert database
ALTER DATABASE myapp CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-- Convert each table
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-- Check for index length issues (VARCHAR(255) in utf8mb4 = 1020 bytes)
-- InnoDB max key length is 3072 bytes (MySQL 5.7+), 767 bytes (MySQL 5.6)
PostgreSQL
PostgreSQL has the simplest and most correct Unicode setup. Its TEXT type
stores variable-length strings with no artificial length limit, and encoding is
set at database creation.
Database Encoding
-- Create database with UTF-8 (the default on most modern systems)
CREATE DATABASE myapp
ENCODING 'UTF8'
LC_COLLATE 'en_US.UTF-8'
LC_CTYPE 'en_US.UTF-8';
-- Check encoding
SHOW server_encoding; -- UTF8
SHOW client_encoding; -- UTF8
PostgreSQL validates that all text is valid UTF-8 on insertion. Invalid byte sequences are rejected with an error, preventing data corruption:
-- This will ERROR if the bytes are not valid UTF-8
INSERT INTO users (name) VALUES (E'\\x80invalid');
-- ERROR: invalid byte sequence for encoding "UTF8": 0x80
Text Types
| Type | Description |
|---|---|
TEXT |
Unlimited variable-length string (preferred) |
VARCHAR(n) |
Variable-length with max n characters (not bytes) |
CHAR(n) |
Fixed-length, space-padded (rarely useful) |
PostgreSQL's VARCHAR(n) counts characters, not bytes — unlike MySQL where
the behavior depends on the character set.
ICU Collations (PostgreSQL 15+)
PostgreSQL 15 introduced ICU (International Components for Unicode) collation support, bringing database-level locale-aware sorting that is deterministic and portable:
-- Create a column with German collation
CREATE TABLE products (
name TEXT COLLATE "de-x-icu"
);
-- Or at the database level
CREATE DATABASE myapp
ENCODING 'UTF8'
LOCALE_PROVIDER icu
ICU_LOCALE 'und'
ICU_RULES '&V << w <<< W'; -- Custom sort rules
Useful Functions
-- Character length (code points)
SELECT length('café🐍'); -- 5
-- Byte length
SELECT octet_length('café🐍'); -- 9
-- Unicode code point
SELECT ascii('→'); -- 8594
-- Character from code point
SELECT chr(8594); -- →
-- Normalize (PostgreSQL 13+)
SELECT normalize('cafe' || E'\\u0301', NFC); -- café (precomposed)
-- Regular expressions with Unicode
SELECT 'café' ~ '\\mcaf'; -- true (word boundary match)
SQLite
SQLite's Unicode support is minimal by design but sufficient for most use cases.
Encoding
SQLite stores all text as UTF-8 internally (or UTF-16 if compiled with that
option). The TEXT affinity accepts any valid UTF-8 string:
CREATE TABLE notes (
id INTEGER PRIMARY KEY,
content TEXT
);
INSERT INTO notes (content) VALUES ('café🐍日本語');
SELECT length(content) FROM notes; -- 8 (code points, not bytes)
length() counts Unicode code points in SQLite, which is correct for most
purposes.
Collation Limitations
SQLite's built-in collation only handles ASCII case-insensitive comparison. The
NOCASE collation folds ASCII A–Z to a–z but does not handle é = É or ü = Ü:
-- ASCII-only case folding
SELECT 'café' = 'CAFÉ' COLLATE NOCASE; -- 0 (false!) — only ASCII folded
SELECT 'cafe' = 'CAFE' COLLATE NOCASE; -- 1 (true)
For proper Unicode collation, you must register a custom collation function via your application or use the ICU extension:
# Python: register a Unicode-aware collation
import sqlite3
import locale
def unicode_collation(a, b):
return locale.strcoll(a, b)
conn = sqlite3.connect(':memory:')
conn.create_collation('UNICODE', unicode_collation)
The ICU Extension
SQLite's optional ICU extension provides LIKE, UPPER(), LOWER(), and
collation support for all Unicode characters:
-- With ICU extension loaded
SELECT upper('café'); -- 'CAFÉ' (correct)
SELECT 'café' LIKE 'CAF_'; -- true (Unicode-aware LIKE)
SQL Server
SQL Server uses a two-tier system: traditional code-page-based types (CHAR,
VARCHAR) and Unicode types (NCHAR, NVARCHAR).
Unicode Types: The N Prefix
| Type | Encoding | Max Size | Unicode |
|---|---|---|---|
CHAR(n) |
Code page (e.g., Latin1) | 8,000 bytes | No |
VARCHAR(n) |
Code page | 8,000 bytes / MAX | No |
NCHAR(n) |
UCS-2 / UTF-16 | 4,000 chars | Yes |
NVARCHAR(n) |
UCS-2 / UTF-16 | 4,000 chars / MAX | Yes |
Always use NVARCHAR for text that may contain non-ASCII characters.
-- Store Unicode text
CREATE TABLE users (
id INT IDENTITY PRIMARY KEY,
name NVARCHAR(100), -- Unicode-safe
legacy_name VARCHAR(100) -- ASCII/code-page only
);
-- The N prefix for Unicode string literals
INSERT INTO users (name) VALUES (N'café🐍');
INSERT INTO users (name) VALUES (N'日本語');
-- Without N prefix, the literal is interpreted as the database's code page
INSERT INTO users (legacy_name) VALUES ('café'); -- may lose characters
Supplementary Characters (SQL Server 2012+)
Earlier versions of SQL Server used UCS-2 (a fixed 2-byte encoding that cannot
represent characters above U+FFFF). Modern versions use UTF-16 when the
collation name includes _SC (Supplementary Characters) or _UTF8:
-- Collation with supplementary character support
ALTER DATABASE mydb COLLATE Latin1_General_100_CI_AS_SC;
-- UTF-8 collation (SQL Server 2019+)
ALTER DATABASE mydb COLLATE Latin1_General_100_CI_AS_SC_UTF8;
With the _UTF8 collation suffix (SQL Server 2019+), even VARCHAR columns
store UTF-8 data, combining storage efficiency with full Unicode support.
Useful Functions
-- Character length
SELECT LEN(N'café🐍'); -- 5
-- Byte length
SELECT DATALENGTH(N'café🐍'); -- 12 (UTF-16: 4*2 + 2*2 for surrogate pair)
-- Unicode code point
SELECT UNICODE(N'→'); -- 8594
-- Character from code point
SELECT NCHAR(8594); -- →
Cross-Database Comparison
| Feature | MySQL | PostgreSQL | SQLite | SQL Server |
|---|---|---|---|---|
| Full UTF-8 | utf8mb4 |
Default | Default | _UTF8 collation |
| Emoji support | utf8mb4 only |
Yes | Yes | NVARCHAR + _SC |
| Length function | Characters | Characters | Code points | Characters |
| Encoding validation | Loose | Strict (rejects invalid) | None | Loose |
| ICU collation | No | 15+ | Extension | No (own rules) |
| Normalization | Manual | normalize() (13+) |
Manual | Manual |
Common Mistakes
- MySQL
utf8instead ofutf8mb4— the number one cause of emoji being replaced with?or causing insert errors. - Forgetting connection charset — database is
utf8mb4but connection sendslatin1, causing double-encoding. - SQL Server without N prefix —
'café'is interpreted in the server's code page;N'café'is Unicode. - Comparing unnormalized strings —
café(NFC) andcafé(NFD) may not match in a binary collation. - Index length limits —
utf8mb4uses 4 bytes per character; aVARCHAR(255)column needs 1,020 bytes of index space.
Choosing the right character set, collation, and column type at the start of a
project is far easier than migrating later. When in doubt: PostgreSQL with UTF-8,
MySQL with utf8mb4, SQLite with default UTF-8, and SQL Server with NVARCHAR.
Mehr in Unicode in Code
Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …
JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …
Java's char type is a 16-bit UTF-16 code unit, not a full …
Go's string type is a sequence of bytes, and its rune type …
Rust's str and String types are guaranteed to be valid UTF-8, making …
C and C++ have historically poor Unicode support, with char being a …
Ruby strings carry an explicit encoding, with UTF-8 being the default since …
PHP's built-in string functions operate on bytes rather than Unicode characters, which …
Swift's String type is designed with Unicode correctness as a first-class concern, …
HTML and CSS support Unicode characters directly and through escape sequences, allowing …
Unicode-aware regular expressions let you match characters by script, category, or property …
URLs are technically restricted to ASCII characters, so non-ASCII text must be …
Every major programming language has its own syntax for embedding Unicode characters …
JSON is defined as Unicode text and must be encoded in UTF-8, …