Unicode in PHP
PHP's built-in string functions operate on bytes rather than Unicode characters, which means strlen, substr, and similar functions produce wrong results for multibyte text without the mbstring extension. This guide explains how to handle Unicode correctly in PHP using mb_ functions, intl, and modern PHP practices.
PHP has a complicated relationship with Unicode. Unlike Python 3 or Swift, PHP
strings are not sequences of Unicode code points — they are raw byte sequences.
A string in PHP is literally a counted array of bytes with no inherent
encoding. This design dates back to PHP 3 (1998), long before UTF-8 became the
web's dominant encoding. Modern PHP (8.x) has excellent Unicode support, but only
if you use the right functions. This guide covers everything you need to handle
Unicode safely in PHP.
PHP Strings Are Byte Sequences
The single most important fact about PHP strings: strlen() counts bytes,
not characters.
$text = "café";
echo strlen($text); // 5 — not 4! The "é" is 2 bytes in UTF-8
echo $text[3]; // first byte of "é" — a broken character
$emoji = "🐍";
echo strlen($emoji); // 4 — the snake emoji is 4 bytes in UTF-8
PHP string functions like substr(), strpos(), strtolower(), and
str_replace() all operate on raw bytes. Using them on multi-byte UTF-8 text
will silently corrupt your data — slicing in the middle of a multi-byte
character, missing search targets, or producing garbled case conversions.
The mbstring Extension
PHP's primary Unicode support comes from the mbstring (multi-byte string)
extension. It ships with PHP but may need to be enabled in php.ini:
extension=mbstring
The mb_* functions mirror the standard string functions but are encoding-aware:
$text = "café";
// Byte-level (wrong for Unicode)
echo strlen($text); // 5
echo strtoupper($text); // "CAFé" — é is not uppercased
// Character-level (correct)
echo mb_strlen($text, 'UTF-8'); // 4
echo mb_strtoupper($text, 'UTF-8'); // "CAFÉ"
echo mb_substr($text, 0, 3, 'UTF-8'); // "caf"
echo mb_strpos($text, 'é', 0, 'UTF-8'); // 3
Key mb_* Functions
| Standard Function | mb_* Equivalent |
Purpose |
|---|---|---|
strlen() |
mb_strlen() |
Character count |
strpos() |
mb_strpos() |
Find position |
substr() |
mb_substr() |
Extract substring |
strtolower() |
mb_strtolower() |
Lowercase |
strtoupper() |
mb_strtoupper() |
Uppercase |
substr_count() |
mb_substr_count() |
Count occurrences |
str_split() |
mb_str_split() |
Split to array (PHP 7.4+) |
| — | mb_detect_encoding() |
Guess encoding |
| — | mb_convert_encoding() |
Convert between encodings |
Setting a Default Encoding
Rather than passing 'UTF-8' to every function call, set the internal encoding
globally:
mb_internal_encoding('UTF-8');
// Now you can omit the encoding parameter
echo mb_strlen("café"); // 4
echo mb_strtoupper("café"); // "CAFÉ"
In php.ini:
mbstring.internal_encoding = "UTF-8"
PHP 8.x Unicode Improvements
PHP 8.0 through 8.4 brought several improvements for Unicode handling:
mb_str_split() (PHP 7.4+)
Splits a string into an array of characters, respecting multi-byte encoding:
$chars = mb_str_split("café🐍", 1, 'UTF-8');
// ["c", "a", "f", "é", "🐍"]
echo count($chars); // 5
mb_str_pad() (PHP 8.3+)
A multi-byte-safe version of str_pad():
echo mb_str_pad("日本", 10, "*");
// "日本********" — padded with full-width asterisks
Improved mb_detect_encoding() (PHP 8.1+)
The encoding detection logic was rewritten in PHP 8.1 to be more accurate and less prone to false positives.
Fibers and Intl Extension
The intl extension (built on ICU) provides advanced Unicode features that
mbstring lacks:
// Locale-aware collation
$collator = new Collator('de_DE');
$words = ["Äpfel", "Orangen", "Bananen"];
$collator->sort($words);
// ["Äpfel", "Bananen", "Orangen"] — German alphabetical order
// Unicode normalization
$nfc = Normalizer::normalize("e\u{0301}", Normalizer::FORM_C);
echo $nfc; // "é" — single precomposed character
echo mb_strlen($nfc); // 1
// Grapheme-aware operations
$family = "👨👩👧";
echo grapheme_strlen($family); // 1 — one grapheme cluster
echo mb_strlen($family); // 5 — five code points
Encoding Conversion
Converting between encodings is essential when processing data from legacy systems, databases, or external APIs:
// Convert Shift-JIS to UTF-8
$utf8 = mb_convert_encoding($shiftjis_data, 'UTF-8', 'SJIS');
// Convert from multiple possible source encodings
$utf8 = mb_convert_encoding($data, 'UTF-8', 'UTF-8,SJIS,EUC-JP,ISO-8859-1');
// Convert to HTML entities (for safe embedding in ASCII contexts)
$safe = mb_encode_numericentity("日本語", [0x80, 0xFFFF, 0, 0xFFFF], 'UTF-8');
// "日本語"
Detecting Unknown Encodings
$encoding = mb_detect_encoding($data, ['UTF-8', 'SJIS', 'EUC-JP', 'ISO-8859-1'], true);
if ($encoding === false) {
throw new RuntimeException("Unable to detect encoding");
}
$utf8 = mb_convert_encoding($data, 'UTF-8', $encoding);
The true parameter enables strict mode, reducing false positives.
Unicode Escape Sequences
PHP 7.0+ supports Unicode escape sequences in double-quoted strings:
echo "\u{0041}"; // "A"
echo "\u{00E9}"; // "é"
echo "\u{1F40D}"; // "🐍"
echo "\u{2192}"; // "→"
// Variable-length hex — 1 to 6 digits
echo "\u{41}"; // "A" — same as \u{0041}
These escapes work only in double-quoted strings and heredocs, not in single-quoted strings:
echo '\u{0041}'; // Literal: \u{0041} — no interpretation
echo "\u{0041}"; // "A" — interpreted as Unicode escape
Regular Expressions with Unicode
PHP's PCRE (preg_*) functions need the u modifier for UTF-8 support:
// Without /u — treats bytes as individual characters
preg_match('/./', "🐍"); // matches first BYTE, not the emoji
// With /u — treats the string as UTF-8 code points
preg_match('/./u', "🐍"); // matches the full emoji
// Unicode property escapes with /u
preg_match('/\p{L}+/u', "café"); // matches "café" — \p{L} = letter
preg_match('/\p{Emoji}+/u', "🐍🎉"); // matches the emoji sequence
preg_match('/\p{Han}+/u', "漢字"); // matches CJK characters
preg_match('/\p{Cyrillic}+/u', "Привет"); // matches Cyrillic text
Without the u modifier, \w, \d, and . operate on single bytes, which
will silently corrupt multi-byte characters in your matches.
Common Pitfalls
1. Using Standard String Functions on UTF-8
// WRONG — corrupts multi-byte characters
$first3 = substr("日本語テスト", 0, 3);
echo $first3; // garbled — cut in the middle of a UTF-8 sequence
// CORRECT
$first3 = mb_substr("日本語テスト", 0, 3, 'UTF-8');
echo $first3; // "日本語"
2. JSON Encoding
json_encode() handles UTF-8 correctly by default in PHP 7.2+, but requires
valid UTF-8 input:
$data = ['name' => '日本語'];
echo json_encode($data);
// {"name":"\u65e5\u672c\u8a9e"} — escaped by default
echo json_encode($data, JSON_UNESCAPED_UNICODE);
// {"name":"日本語"} — human-readable
If the input contains invalid UTF-8, json_encode() returns false. Always
validate encoding first:
if (!mb_check_encoding($input, 'UTF-8')) {
$input = mb_convert_encoding($input, 'UTF-8', 'UTF-8'); // strips invalid bytes
}
3. Database Connections
Always set the connection charset to utf8mb4 (not utf8 in MySQL, which is
limited to 3-byte characters and cannot store emoji):
// PDO
$pdo = new PDO('mysql:host=localhost;dbname=test;charset=utf8mb4', $user, $pass);
// MySQLi
$mysqli = new mysqli('localhost', $user, $pass, 'test');
$mysqli->set_charset('utf8mb4');
4. HTTP Headers
Always declare UTF-8 in your Content-Type header:
header('Content-Type: text/html; charset=UTF-8');
5. File BOM (Byte Order Mark)
Some editors prepend a UTF-8 BOM (\xEF\xBB\xBF) to files. This can break
HTTP headers, JSON output, and XML parsing. Strip it when reading:
$content = file_get_contents('data.txt');
$content = preg_replace('/^\xEF\xBB\xBF/', '', $content);
Quick Reference
| Task | Code |
|---|---|
| Character count | mb_strlen($str, 'UTF-8') |
| Substring | mb_substr($str, $start, $len, 'UTF-8') |
| Uppercase | mb_strtoupper($str, 'UTF-8') |
| Lowercase | mb_strtolower($str, 'UTF-8') |
| Find position | mb_strpos($str, $needle, 0, 'UTF-8') |
| Split to chars | mb_str_split($str, 1, 'UTF-8') |
| Convert encoding | mb_convert_encoding($str, 'UTF-8', $from) |
| Unicode escape | "\u{1F40D}" → "🐍" |
| Regex with Unicode | preg_match('/\p{L}+/u', $str) |
| Normalize NFC | Normalizer::normalize($str, Normalizer::FORM_C) |
| Grapheme count | grapheme_strlen($str) |
| JSON with Unicode | json_encode($data, JSON_UNESCAPED_UNICODE) |
PHP's byte-oriented string design is a historical artifact, but the mbstring
and intl extensions provide everything you need for robust Unicode handling.
The critical rule: never use standard string functions on user-supplied text.
Always use mb_* equivalents, always declare UTF-8 in your HTTP headers and
database connections, and always validate encoding at your application's input
boundaries.
Unicode in Code 中的更多内容
Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …
JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …
Java's char type is a 16-bit UTF-16 code unit, not a full …
Go's string type is a sequence of bytes, and its rune type …
Rust's str and String types are guaranteed to be valid UTF-8, making …
C and C++ have historically poor Unicode support, with char being a …
Ruby strings carry an explicit encoding, with UTF-8 being the default since …
Swift's String type is designed with Unicode correctness as a first-class concern, …
HTML and CSS support Unicode characters directly and through escape sequences, allowing …
Unicode-aware regular expressions let you match characters by script, category, or property …
SQL databases store text in encodings and collations that determine how characters …
URLs are technically restricted to ASCII characters, so non-ASCII text must be …
Every major programming language has its own syntax for embedding Unicode characters …
JSON is defined as Unicode text and must be encoded in UTF-8, …