💻 Unicode in Code

Unicode in PHP

PHP's built-in string functions operate on bytes rather than Unicode characters, which means strlen, substr, and similar functions produce wrong results for multibyte text without the mbstring extension. This guide explains how to handle Unicode correctly in PHP using mb_ functions, intl, and modern PHP practices.

·

PHP has a complicated relationship with Unicode. Unlike Python 3 or Swift, PHP strings are not sequences of Unicode code points — they are raw byte sequences. A string in PHP is literally a counted array of bytes with no inherent encoding. This design dates back to PHP 3 (1998), long before UTF-8 became the web's dominant encoding. Modern PHP (8.x) has excellent Unicode support, but only if you use the right functions. This guide covers everything you need to handle Unicode safely in PHP.

PHP Strings Are Byte Sequences

The single most important fact about PHP strings: strlen() counts bytes, not characters.

$text = "café";
echo strlen($text);     // 5 — not 4! The "é" is 2 bytes in UTF-8
echo $text[3];          // first byte of "é" — a broken character

$emoji = "🐍";
echo strlen($emoji);    // 4 — the snake emoji is 4 bytes in UTF-8

PHP string functions like substr(), strpos(), strtolower(), and str_replace() all operate on raw bytes. Using them on multi-byte UTF-8 text will silently corrupt your data — slicing in the middle of a multi-byte character, missing search targets, or producing garbled case conversions.

The mbstring Extension

PHP's primary Unicode support comes from the mbstring (multi-byte string) extension. It ships with PHP but may need to be enabled in php.ini:

extension=mbstring

The mb_* functions mirror the standard string functions but are encoding-aware:

$text = "café";

// Byte-level (wrong for Unicode)
echo strlen($text);          // 5
echo strtoupper($text);      // "CAFé" — é is not uppercased

// Character-level (correct)
echo mb_strlen($text, 'UTF-8');          // 4
echo mb_strtoupper($text, 'UTF-8');      // "CAFÉ"
echo mb_substr($text, 0, 3, 'UTF-8');   // "caf"
echo mb_strpos($text, 'é', 0, 'UTF-8'); // 3

Key mb_* Functions

Standard Function mb_* Equivalent Purpose
strlen() mb_strlen() Character count
strpos() mb_strpos() Find position
substr() mb_substr() Extract substring
strtolower() mb_strtolower() Lowercase
strtoupper() mb_strtoupper() Uppercase
substr_count() mb_substr_count() Count occurrences
str_split() mb_str_split() Split to array (PHP 7.4+)
mb_detect_encoding() Guess encoding
mb_convert_encoding() Convert between encodings

Setting a Default Encoding

Rather than passing 'UTF-8' to every function call, set the internal encoding globally:

mb_internal_encoding('UTF-8');

// Now you can omit the encoding parameter
echo mb_strlen("café");     // 4
echo mb_strtoupper("café"); // "CAFÉ"

In php.ini:

mbstring.internal_encoding = "UTF-8"

PHP 8.x Unicode Improvements

PHP 8.0 through 8.4 brought several improvements for Unicode handling:

mb_str_split() (PHP 7.4+)

Splits a string into an array of characters, respecting multi-byte encoding:

$chars = mb_str_split("café🐍", 1, 'UTF-8');
// ["c", "a", "f", "é", "🐍"]
echo count($chars);  // 5

mb_str_pad() (PHP 8.3+)

A multi-byte-safe version of str_pad():

echo mb_str_pad("日本", 10, "*");
// "日本********" — padded with full-width asterisks

Improved mb_detect_encoding() (PHP 8.1+)

The encoding detection logic was rewritten in PHP 8.1 to be more accurate and less prone to false positives.

Fibers and Intl Extension

The intl extension (built on ICU) provides advanced Unicode features that mbstring lacks:

// Locale-aware collation
$collator = new Collator('de_DE');
$words = ["Äpfel", "Orangen", "Bananen"];
$collator->sort($words);
// ["Äpfel", "Bananen", "Orangen"] — German alphabetical order

// Unicode normalization
$nfc = Normalizer::normalize("e\u{0301}", Normalizer::FORM_C);
echo $nfc;           // "é" — single precomposed character
echo mb_strlen($nfc); // 1

// Grapheme-aware operations
$family = "👨‍👩‍👧";
echo grapheme_strlen($family);    // 1 — one grapheme cluster
echo mb_strlen($family);          // 5 — five code points

Encoding Conversion

Converting between encodings is essential when processing data from legacy systems, databases, or external APIs:

// Convert Shift-JIS to UTF-8
$utf8 = mb_convert_encoding($shiftjis_data, 'UTF-8', 'SJIS');

// Convert from multiple possible source encodings
$utf8 = mb_convert_encoding($data, 'UTF-8', 'UTF-8,SJIS,EUC-JP,ISO-8859-1');

// Convert to HTML entities (for safe embedding in ASCII contexts)
$safe = mb_encode_numericentity("日本語", [0x80, 0xFFFF, 0, 0xFFFF], 'UTF-8');
// "日本語"

Detecting Unknown Encodings

$encoding = mb_detect_encoding($data, ['UTF-8', 'SJIS', 'EUC-JP', 'ISO-8859-1'], true);
if ($encoding === false) {
    throw new RuntimeException("Unable to detect encoding");
}
$utf8 = mb_convert_encoding($data, 'UTF-8', $encoding);

The true parameter enables strict mode, reducing false positives.

Unicode Escape Sequences

PHP 7.0+ supports Unicode escape sequences in double-quoted strings:

echo "\u{0041}";        // "A"
echo "\u{00E9}";        // "é"
echo "\u{1F40D}";       // "🐍"
echo "\u{2192}";        // "→"

// Variable-length hex — 1 to 6 digits
echo "\u{41}";          // "A" — same as \u{0041}

These escapes work only in double-quoted strings and heredocs, not in single-quoted strings:

echo '\u{0041}';        // Literal: \u{0041} — no interpretation
echo "\u{0041}";        // "A" — interpreted as Unicode escape

Regular Expressions with Unicode

PHP's PCRE (preg_*) functions need the u modifier for UTF-8 support:

// Without /u — treats bytes as individual characters
preg_match('/./', "🐍");     // matches first BYTE, not the emoji

// With /u — treats the string as UTF-8 code points
preg_match('/./u', "🐍");    // matches the full emoji

// Unicode property escapes with /u
preg_match('/\p{L}+/u', "café");       // matches "café" — \p{L} = letter
preg_match('/\p{Emoji}+/u', "🐍🎉");  // matches the emoji sequence
preg_match('/\p{Han}+/u', "漢字");     // matches CJK characters
preg_match('/\p{Cyrillic}+/u', "Привет"); // matches Cyrillic text

Without the u modifier, \w, \d, and . operate on single bytes, which will silently corrupt multi-byte characters in your matches.

Common Pitfalls

1. Using Standard String Functions on UTF-8

// WRONG — corrupts multi-byte characters
$first3 = substr("日本語テスト", 0, 3);
echo $first3;  // garbled — cut in the middle of a UTF-8 sequence

// CORRECT
$first3 = mb_substr("日本語テスト", 0, 3, 'UTF-8');
echo $first3;  // "日本語"

2. JSON Encoding

json_encode() handles UTF-8 correctly by default in PHP 7.2+, but requires valid UTF-8 input:

$data = ['name' => '日本語'];
echo json_encode($data);
// {"name":"\u65e5\u672c\u8a9e"} — escaped by default

echo json_encode($data, JSON_UNESCAPED_UNICODE);
// {"name":"日本語"} — human-readable

If the input contains invalid UTF-8, json_encode() returns false. Always validate encoding first:

if (!mb_check_encoding($input, 'UTF-8')) {
    $input = mb_convert_encoding($input, 'UTF-8', 'UTF-8'); // strips invalid bytes
}

3. Database Connections

Always set the connection charset to utf8mb4 (not utf8 in MySQL, which is limited to 3-byte characters and cannot store emoji):

// PDO
$pdo = new PDO('mysql:host=localhost;dbname=test;charset=utf8mb4', $user, $pass);

// MySQLi
$mysqli = new mysqli('localhost', $user, $pass, 'test');
$mysqli->set_charset('utf8mb4');

4. HTTP Headers

Always declare UTF-8 in your Content-Type header:

header('Content-Type: text/html; charset=UTF-8');

5. File BOM (Byte Order Mark)

Some editors prepend a UTF-8 BOM (\xEF\xBB\xBF) to files. This can break HTTP headers, JSON output, and XML parsing. Strip it when reading:

$content = file_get_contents('data.txt');
$content = preg_replace('/^\xEF\xBB\xBF/', '', $content);

Quick Reference

Task Code
Character count mb_strlen($str, 'UTF-8')
Substring mb_substr($str, $start, $len, 'UTF-8')
Uppercase mb_strtoupper($str, 'UTF-8')
Lowercase mb_strtolower($str, 'UTF-8')
Find position mb_strpos($str, $needle, 0, 'UTF-8')
Split to chars mb_str_split($str, 1, 'UTF-8')
Convert encoding mb_convert_encoding($str, 'UTF-8', $from)
Unicode escape "\u{1F40D}" → "🐍"
Regex with Unicode preg_match('/\p{L}+/u', $str)
Normalize NFC Normalizer::normalize($str, Normalizer::FORM_C)
Grapheme count grapheme_strlen($str)
JSON with Unicode json_encode($data, JSON_UNESCAPED_UNICODE)

PHP's byte-oriented string design is a historical artifact, but the mbstring and intl extensions provide everything you need for robust Unicode handling. The critical rule: never use standard string functions on user-supplied text. Always use mb_* equivalents, always declare UTF-8 in your HTTP headers and database connections, and always validate encoding at your application's input boundaries.

เพิ่มเติมใน Unicode in Code