💻 Unicode in Code

Unicode in SQL

SQL databases store text in encodings and collations that determine how characters are saved, compared, and sorted, with UTF-8 and UTF-16 being the most common choices. This guide covers Unicode in MySQL, PostgreSQL, and SQLite, explaining how to choose the right charset, collation, and column type for international data.

·

Database Unicode support is one of the most common sources of encoding bugs in production systems. A query that works perfectly in development can silently truncate emoji, mangle accented characters, or produce wrong sort orders when the database, table, or column uses the wrong character set or collation. This guide covers Unicode handling across the four most popular relational databases: MySQL, PostgreSQL, SQLite, and SQL Server.

The Core Problem

When you store the string "café🐍" in a database column, three things must agree:

  1. Character set — what set of characters can be stored (e.g., UTF-8)
  2. Encoding — how those characters are stored as bytes
  3. Collation — how strings are compared and sorted

If any one of these is wrong, you get data loss, mojibake, or incorrect query results.

MySQL / MariaDB

MySQL has the most confusing Unicode setup of any major database, primarily because of a historical naming mistake.

utf8 vs utf8mb4

MySQL's utf8 character set is not real UTF-8. It supports only up to 3 bytes per character, which means it cannot store characters above U+FFFF — including all emoji, many CJK extension characters, and musical symbols.

Character Set Max Bytes Unicode Range Emoji Support
utf8 (utf8mb3) 3 BMP only (U+0000–U+FFFF) No
utf8mb4 4 Full Unicode (U+0000–U+10FFFF) Yes

Always use utf8mb4. The utf8 alias was a deliberate compromise made in MySQL 4.1 (2004) to save one byte per character in index calculations. It has caused untold data loss since then.

-- Create a database with full Unicode support
CREATE DATABASE myapp
  CHARACTER SET utf8mb4
  COLLATE utf8mb4_unicode_ci;

-- Create a table
CREATE TABLE users (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
    bio TEXT CHARACTER SET utf8mb4
);

-- Insert emoji safely
INSERT INTO users (name, bio) VALUES ('Alice', 'Loves snakes 🐍');

MySQL Collations

A collation determines how strings are compared and sorted:

Collation Behavior Use Case
utf8mb4_general_ci Fast, approximate Unicode sorting Legacy apps
utf8mb4_unicode_ci UCA-based, more correct sorting Most applications
utf8mb4_0900_ai_ci Unicode 9.0, accent-insensitive (MySQL 8.0+) Modern apps
utf8mb4_0900_as_cs Accent-sensitive, case-sensitive Exact matching
utf8mb4_bin Binary comparison (byte-by-byte) Hash lookups, exact match

The ci suffix means case-insensitive, cs means case-sensitive, ai means accent-insensitive, as means accent-sensitive.

-- MySQL 8.0 default collation
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

-- Check current settings
SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';

Connection Charset

Setting the database to utf8mb4 is not enough — the connection must also use utf8mb4:

SET NAMES utf8mb4;
-- or equivalently:
SET character_set_client = utf8mb4;
SET character_set_connection = utf8mb4;
SET character_set_results = utf8mb4;

In application code:

# Python (MySQLdb / PyMySQL)
connection = pymysql.connect(host='localhost', charset='utf8mb4')

# PHP (PDO)
$pdo = new PDO('mysql:host=localhost;dbname=test;charset=utf8mb4', $user, $pass);

# Node.js (mysql2)
const conn = await mysql.createConnection({ charset: 'utf8mb4' });

Migrating from utf8 to utf8mb4

-- Convert database
ALTER DATABASE myapp CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- Convert each table
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- Check for index length issues (VARCHAR(255) in utf8mb4 = 1020 bytes)
-- InnoDB max key length is 3072 bytes (MySQL 5.7+), 767 bytes (MySQL 5.6)

PostgreSQL

PostgreSQL has the simplest and most correct Unicode setup. Its TEXT type stores variable-length strings with no artificial length limit, and encoding is set at database creation.

Database Encoding

-- Create database with UTF-8 (the default on most modern systems)
CREATE DATABASE myapp
  ENCODING 'UTF8'
  LC_COLLATE 'en_US.UTF-8'
  LC_CTYPE 'en_US.UTF-8';

-- Check encoding
SHOW server_encoding;    -- UTF8
SHOW client_encoding;    -- UTF8

PostgreSQL validates that all text is valid UTF-8 on insertion. Invalid byte sequences are rejected with an error, preventing data corruption:

-- This will ERROR if the bytes are not valid UTF-8
INSERT INTO users (name) VALUES (E'\\x80invalid');
-- ERROR: invalid byte sequence for encoding "UTF8": 0x80

Text Types

Type Description
TEXT Unlimited variable-length string (preferred)
VARCHAR(n) Variable-length with max n characters (not bytes)
CHAR(n) Fixed-length, space-padded (rarely useful)

PostgreSQL's VARCHAR(n) counts characters, not bytes — unlike MySQL where the behavior depends on the character set.

ICU Collations (PostgreSQL 15+)

PostgreSQL 15 introduced ICU (International Components for Unicode) collation support, bringing database-level locale-aware sorting that is deterministic and portable:

-- Create a column with German collation
CREATE TABLE products (
    name TEXT COLLATE "de-x-icu"
);

-- Or at the database level
CREATE DATABASE myapp
  ENCODING 'UTF8'
  LOCALE_PROVIDER icu
  ICU_LOCALE 'und'
  ICU_RULES '&V << w <<< W';  -- Custom sort rules

Useful Functions

-- Character length (code points)
SELECT length('café🐍');           -- 5

-- Byte length
SELECT octet_length('café🐍');     -- 9

-- Unicode code point
SELECT ascii('→');                  -- 8594

-- Character from code point
SELECT chr(8594);                   -- →

-- Normalize (PostgreSQL 13+)
SELECT normalize('cafe' || E'\\u0301', NFC);  -- café (precomposed)

-- Regular expressions with Unicode
SELECT 'café' ~ '\\mcaf';          -- true (word boundary match)

SQLite

SQLite's Unicode support is minimal by design but sufficient for most use cases.

Encoding

SQLite stores all text as UTF-8 internally (or UTF-16 if compiled with that option). The TEXT affinity accepts any valid UTF-8 string:

CREATE TABLE notes (
    id INTEGER PRIMARY KEY,
    content TEXT
);

INSERT INTO notes (content) VALUES ('café🐍日本語');
SELECT length(content) FROM notes;  -- 8 (code points, not bytes)

length() counts Unicode code points in SQLite, which is correct for most purposes.

Collation Limitations

SQLite's built-in collation only handles ASCII case-insensitive comparison. The NOCASE collation folds ASCII A–Z to a–z but does not handle é = É or ü = Ü:

-- ASCII-only case folding
SELECT 'café' = 'CAFÉ' COLLATE NOCASE;  -- 0 (false!) — only ASCII folded
SELECT 'cafe' = 'CAFE' COLLATE NOCASE;  -- 1 (true)

For proper Unicode collation, you must register a custom collation function via your application or use the ICU extension:

# Python: register a Unicode-aware collation
import sqlite3
import locale

def unicode_collation(a, b):
    return locale.strcoll(a, b)

conn = sqlite3.connect(':memory:')
conn.create_collation('UNICODE', unicode_collation)

The ICU Extension

SQLite's optional ICU extension provides LIKE, UPPER(), LOWER(), and collation support for all Unicode characters:

-- With ICU extension loaded
SELECT upper('café');      -- 'CAFÉ' (correct)
SELECT 'café' LIKE 'CAF_'; -- true (Unicode-aware LIKE)

SQL Server

SQL Server uses a two-tier system: traditional code-page-based types (CHAR, VARCHAR) and Unicode types (NCHAR, NVARCHAR).

Unicode Types: The N Prefix

Type Encoding Max Size Unicode
CHAR(n) Code page (e.g., Latin1) 8,000 bytes No
VARCHAR(n) Code page 8,000 bytes / MAX No
NCHAR(n) UCS-2 / UTF-16 4,000 chars Yes
NVARCHAR(n) UCS-2 / UTF-16 4,000 chars / MAX Yes

Always use NVARCHAR for text that may contain non-ASCII characters.

-- Store Unicode text
CREATE TABLE users (
    id INT IDENTITY PRIMARY KEY,
    name NVARCHAR(100),      -- Unicode-safe
    legacy_name VARCHAR(100)  -- ASCII/code-page only
);

-- The N prefix for Unicode string literals
INSERT INTO users (name) VALUES (N'café🐍');
INSERT INTO users (name) VALUES (N'日本語');

-- Without N prefix, the literal is interpreted as the database's code page
INSERT INTO users (legacy_name) VALUES ('café');  -- may lose characters

Supplementary Characters (SQL Server 2012+)

Earlier versions of SQL Server used UCS-2 (a fixed 2-byte encoding that cannot represent characters above U+FFFF). Modern versions use UTF-16 when the collation name includes _SC (Supplementary Characters) or _UTF8:

-- Collation with supplementary character support
ALTER DATABASE mydb COLLATE Latin1_General_100_CI_AS_SC;

-- UTF-8 collation (SQL Server 2019+)
ALTER DATABASE mydb COLLATE Latin1_General_100_CI_AS_SC_UTF8;

With the _UTF8 collation suffix (SQL Server 2019+), even VARCHAR columns store UTF-8 data, combining storage efficiency with full Unicode support.

Useful Functions

-- Character length
SELECT LEN(N'café🐍');          -- 5

-- Byte length
SELECT DATALENGTH(N'café🐍');   -- 12 (UTF-16: 4*2 + 2*2 for surrogate pair)

-- Unicode code point
SELECT UNICODE(N'→');            -- 8594

-- Character from code point
SELECT NCHAR(8594);              -- →

Cross-Database Comparison

Feature MySQL PostgreSQL SQLite SQL Server
Full UTF-8 utf8mb4 Default Default _UTF8 collation
Emoji support utf8mb4 only Yes Yes NVARCHAR + _SC
Length function Characters Characters Code points Characters
Encoding validation Loose Strict (rejects invalid) None Loose
ICU collation No 15+ Extension No (own rules)
Normalization Manual normalize() (13+) Manual Manual

Common Mistakes

  1. MySQL utf8 instead of utf8mb4 — the number one cause of emoji being replaced with ? or causing insert errors.
  2. Forgetting connection charset — database is utf8mb4 but connection sends latin1, causing double-encoding.
  3. SQL Server without N prefix'café' is interpreted in the server's code page; N'café' is Unicode.
  4. Comparing unnormalized stringscafé (NFC) and café (NFD) may not match in a binary collation.
  5. Index length limitsutf8mb4 uses 4 bytes per character; a VARCHAR(255) column needs 1,020 bytes of index space.

Choosing the right character set, collation, and column type at the start of a project is far easier than migrating later. When in doubt: PostgreSQL with UTF-8, MySQL with utf8mb4, SQLite with default UTF-8, and SQL Server with NVARCHAR.

เพิ่มเติมใน Unicode in Code