💻 Unicode in Code

Unicode in SQL

SQL databases store text in encodings and collations that determine how characters are saved, compared, and sorted, with UTF-8 and UTF-16 being the most common choices. This guide covers Unicode in MySQL, PostgreSQL, and SQLite, explaining how to choose the right charset, collation, and column type for international data.

Published 2022-07-05 · Updated 2025-04-14

Database Unicode support is one of the most common sources of encoding bugs in production systems. A query that works perfectly in development can silently truncate emoji, mangle accented characters, or produce wrong sort orders when the database, table, or column uses the wrong character set or collation. This guide covers Unicode handling across the four most popular relational databases: MySQL, PostgreSQL, SQLite, and SQL Server.

The Core Problem

When you store the string "café🐍" in a database column, three things must agree:

Character set — what set of characters can be stored (e.g., UTF-8)
Encoding — how those characters are stored as bytes
Collation — how strings are compared and sorted

If any one of these is wrong, you get data loss, mojibake, or incorrect query results.

MySQL / MariaDB

MySQL has the most confusing Unicode setup of any major database, primarily because of a historical naming mistake.

`utf8` vs `utf8mb4`

MySQL's utf8 character set is not real UTF-8. It supports only up to 3 bytes per character, which means it cannot store characters above U+FFFF — including all emoji, many CJK extension characters, and musical symbols.

Character Set	Max Bytes	Unicode Range	Emoji Support
`utf8` (`utf8mb3`)	3	BMP only (U+0000–U+FFFF)	No
`utf8mb4`	4	Full Unicode (U+0000–U+10FFFF)	Yes

Always use utf8mb4. The utf8 alias was a deliberate compromise made in MySQL 4.1 (2004) to save one byte per character in index calculations. It has caused untold data loss since then.

-- Create a database with full Unicode support
CREATE DATABASE myapp
  CHARACTER SET utf8mb4
  COLLATE utf8mb4_unicode_ci;

-- Create a table
CREATE TABLE users (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
    bio TEXT CHARACTER SET utf8mb4
);

-- Insert emoji safely
INSERT INTO users (name, bio) VALUES ('Alice', 'Loves snakes 🐍');

MySQL Collations

A collation determines how strings are compared and sorted:

Collation	Behavior	Use Case
`utf8mb4_general_ci`	Fast, approximate Unicode sorting	Legacy apps
`utf8mb4_unicode_ci`	UCA-based, more correct sorting	Most applications
`utf8mb4_0900_ai_ci`	Unicode 9.0, accent-insensitive (MySQL 8.0+)	Modern apps
`utf8mb4_0900_as_cs`	Accent-sensitive, case-sensitive	Exact matching
`utf8mb4_bin`	Binary comparison (byte-by-byte)	Hash lookups, exact match

The ci suffix means case-insensitive, cs means case-sensitive, ai means accent-insensitive, as means accent-sensitive.

-- MySQL 8.0 default collation
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

-- Check current settings
SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';

Connection Charset

Setting the database to utf8mb4 is not enough — the connection must also use utf8mb4:

SET NAMES utf8mb4;
-- or equivalently:
SET character_set_client = utf8mb4;
SET character_set_connection = utf8mb4;
SET character_set_results = utf8mb4;

In application code:

# Python (MySQLdb / PyMySQL)
connection = pymysql.connect(host='localhost', charset='utf8mb4')

# PHP (PDO)
$pdo = new PDO('mysql:host=localhost;dbname=test;charset=utf8mb4', $user, $pass);

# Node.js (mysql2)
const conn = await mysql.createConnection({ charset: 'utf8mb4' });

Migrating from `utf8` to `utf8mb4`

-- Convert database
ALTER DATABASE myapp CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- Convert each table
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- Check for index length issues (VARCHAR(255) in utf8mb4 = 1020 bytes)
-- InnoDB max key length is 3072 bytes (MySQL 5.7+), 767 bytes (MySQL 5.6)

PostgreSQL

PostgreSQL has the simplest and most correct Unicode setup. Its TEXT type stores variable-length strings with no artificial length limit, and encoding is set at database creation.

Database Encoding

-- Create database with UTF-8 (the default on most modern systems)
CREATE DATABASE myapp
  ENCODING 'UTF8'
  LC_COLLATE 'en_US.UTF-8'
  LC_CTYPE 'en_US.UTF-8';

-- Check encoding
SHOW server_encoding;    -- UTF8
SHOW client_encoding;    -- UTF8

PostgreSQL validates that all text is valid UTF-8 on insertion. Invalid byte sequences are rejected with an error, preventing data corruption:

-- This will ERROR if the bytes are not valid UTF-8
INSERT INTO users (name) VALUES (E'\\x80invalid');
-- ERROR: invalid byte sequence for encoding "UTF8": 0x80

Text Types

Type	Description
`TEXT`	Unlimited variable-length string (preferred)
`VARCHAR(n)`	Variable-length with max `n` characters (not bytes)
`CHAR(n)`	Fixed-length, space-padded (rarely useful)

PostgreSQL's VARCHAR(n) counts characters, not bytes — unlike MySQL where the behavior depends on the character set.

ICU Collations (PostgreSQL 15+)

PostgreSQL 15 introduced ICU (International Components for Unicode) collation support, bringing database-level locale-aware sorting that is deterministic and portable:

-- Create a column with German collation
CREATE TABLE products (
    name TEXT COLLATE "de-x-icu"
);

-- Or at the database level
CREATE DATABASE myapp
  ENCODING 'UTF8'
  LOCALE_PROVIDER icu
  ICU_LOCALE 'und'
  ICU_RULES '&V << w <<< W';  -- Custom sort rules

Useful Functions

-- Character length (code points)
SELECT length('café🐍');           -- 5

-- Byte length
SELECT octet_length('café🐍');     -- 9

-- Unicode code point
SELECT ascii('→');                  -- 8594

-- Character from code point
SELECT chr(8594);                   -- →

-- Normalize (PostgreSQL 13+)
SELECT normalize('cafe' || E'\\u0301', NFC);  -- café (precomposed)

-- Regular expressions with Unicode
SELECT 'café' ~ '\\mcaf';          -- true (word boundary match)

SQLite

SQLite's Unicode support is minimal by design but sufficient for most use cases.

Encoding

SQLite stores all text as UTF-8 internally (or UTF-16 if compiled with that option). The TEXT affinity accepts any valid UTF-8 string:

CREATE TABLE notes (
    id INTEGER PRIMARY KEY,
    content TEXT
);

INSERT INTO notes (content) VALUES ('café🐍日本語');
SELECT length(content) FROM notes;  -- 8 (code points, not bytes)

length() counts Unicode code points in SQLite, which is correct for most purposes.

Collation Limitations

SQLite's built-in collation only handles ASCII case-insensitive comparison. The NOCASE collation folds ASCII A–Z to a–z but does not handle é = É or ü = Ü:

-- ASCII-only case folding
SELECT 'café' = 'CAFÉ' COLLATE NOCASE;  -- 0 (false!) — only ASCII folded
SELECT 'cafe' = 'CAFE' COLLATE NOCASE;  -- 1 (true)

For proper Unicode collation, you must register a custom collation function via your application or use the ICU extension:

# Python: register a Unicode-aware collation
import sqlite3
import locale

def unicode_collation(a, b):
    return locale.strcoll(a, b)

conn = sqlite3.connect(':memory:')
conn.create_collation('UNICODE', unicode_collation)

The ICU Extension

SQLite's optional ICU extension provides LIKE, UPPER(), LOWER(), and collation support for all Unicode characters:

-- With ICU extension loaded
SELECT upper('café');      -- 'CAFÉ' (correct)
SELECT 'café' LIKE 'CAF_'; -- true (Unicode-aware LIKE)

SQL Server

SQL Server uses a two-tier system: traditional code-page-based types (CHAR, VARCHAR) and Unicode types (NCHAR, NVARCHAR).

Unicode Types: The N Prefix

Type	Encoding	Max Size	Unicode
`CHAR(n)`	Code page (e.g., Latin1)	8,000 bytes	No
`VARCHAR(n)`	Code page	8,000 bytes / MAX	No
`NCHAR(n)`	UCS-2 / UTF-16	4,000 chars	Yes
`NVARCHAR(n)`	UCS-2 / UTF-16	4,000 chars / MAX	Yes

Always use NVARCHAR for text that may contain non-ASCII characters.

-- Store Unicode text
CREATE TABLE users (
    id INT IDENTITY PRIMARY KEY,
    name NVARCHAR(100),      -- Unicode-safe
    legacy_name VARCHAR(100)  -- ASCII/code-page only
);

-- The N prefix for Unicode string literals
INSERT INTO users (name) VALUES (N'café🐍');
INSERT INTO users (name) VALUES (N'日本語');

-- Without N prefix, the literal is interpreted as the database's code page
INSERT INTO users (legacy_name) VALUES ('café');  -- may lose characters

Supplementary Characters (SQL Server 2012+)

Earlier versions of SQL Server used UCS-2 (a fixed 2-byte encoding that cannot represent characters above U+FFFF). Modern versions use UTF-16 when the collation name includes _SC (Supplementary Characters) or _UTF8:

-- Collation with supplementary character support
ALTER DATABASE mydb COLLATE Latin1_General_100_CI_AS_SC;

-- UTF-8 collation (SQL Server 2019+)
ALTER DATABASE mydb COLLATE Latin1_General_100_CI_AS_SC_UTF8;

With the _UTF8 collation suffix (SQL Server 2019+), even VARCHAR columns store UTF-8 data, combining storage efficiency with full Unicode support.

Useful Functions

-- Character length
SELECT LEN(N'café🐍');          -- 5

-- Byte length
SELECT DATALENGTH(N'café🐍');   -- 12 (UTF-16: 4*2 + 2*2 for surrogate pair)

-- Unicode code point
SELECT UNICODE(N'→');            -- 8594

-- Character from code point
SELECT NCHAR(8594);              -- →

Cross-Database Comparison

Feature	MySQL	PostgreSQL	SQLite	SQL Server
Full UTF-8	`utf8mb4`	Default	Default	`_UTF8` collation
Emoji support	`utf8mb4` only	Yes	Yes	`NVARCHAR` + `_SC`
Length function	Characters	Characters	Code points	Characters
Encoding validation	Loose	Strict (rejects invalid)	None	Loose
ICU collation	No	15+	Extension	No (own rules)
Normalization	Manual	`normalize()` (13+)	Manual	Manual

Common Mistakes

MySQL utf8 instead of utf8mb4 — the number one cause of emoji being replaced with ? or causing insert errors.
Forgetting connection charset — database is utf8mb4 but connection sends latin1, causing double-encoding.
SQL Server without N prefix — 'café' is interpreted in the server's code page; N'café' is Unicode.
Comparing unnormalized strings — café (NFC) and café (NFD) may not match in a binary collation.
Index length limits — utf8mb4 uses 4 bytes per character; a VARCHAR(255) column needs 1,020 bytes of index space.

Choosing the right character set, collation, and column type at the start of a project is far easier than migrating later. When in doubt: PostgreSQL with UTF-8, MySQL with utf8mb4, SQLite with default UTF-8, and SQL Server with NVARCHAR.