Unicode in C/C++
C and C++ have historically poor Unicode support, with char being a single byte and wchar_t having platform-dependent width, leading to decades of encoding bugs. This guide covers modern approaches to Unicode in C/C++, including char8_t, char16_t, char32_t, and the ICU library.
C and C++ have a complicated relationship with Unicode. Both languages predate
the Unicode standard, and their original character types -- char (typically
8 bits) and wchar_t (platform-dependent width) -- were designed for a world
of single-byte encodings. Over the past two decades, both standards have added
new types and literals for Unicode, but the ecosystem remains fragmented.
Understanding the current state of Unicode in C and C++ requires knowing which
features are available in which standard version, and which libraries fill the
remaining gaps.
The Legacy: char and wchar_t
char
The char type is at least 8 bits. It holds a single byte, which means it can
represent one ASCII character or one byte of a multi-byte encoding like UTF-8:
char ascii = 'A'; // 0x41
char utf8[] = "café"; // 6 bytes: 63 61 66 C3 A9 00
printf("%zu\n", strlen(utf8)); // 5 (bytes, not characters)
The C standard says nothing about the encoding of char strings. On modern
systems, char strings are typically UTF-8, but this is a convention, not a
guarantee.
wchar_t
wchar_t was intended as a "wide character" type large enough to hold any
character in the system's character set. Unfortunately, its size varies by
platform:
| Platform | wchar_t Size |
Encoding |
|---|---|---|
| Linux / macOS | 4 bytes | UTF-32 |
| Windows | 2 bytes | UTF-16 |
This makes wchar_t non-portable for Unicode work. A wchar_t on Windows
cannot hold a supplementary character (U+10000 and above) in a single value,
while on Linux it can. Code that uses wchar_t behaves differently on each
platform.
wchar_t w = L'A'; // wide character literal
wchar_t ws[] = L"Hello"; // wide string literal
Modern Unicode Types (C11 / C++11 and Later)
char16_t and char32_t
C11 and C++11 introduced two new character types with fixed, portable widths:
| Type | Size | Encoding | Literal Prefix |
|---|---|---|---|
char16_t |
16 bits | UTF-16 | u |
char32_t |
32 bits | UTF-32 | U |
#include <uchar.h> // C11
// or in C++:
// #include <cuchar>
char16_t utf16_char = u'A'; // U+0041
char32_t utf32_char = U'😀'; // U+1F600
char16_t utf16_str[] = u"Hello, 世界";
char32_t utf32_str[] = U"Hello, 世界";
char32_t can always hold a complete Unicode code point. char16_t may
require a surrogate pair for supplementary characters, just like Java's char.
char8_t (C++20)
C++20 added char8_t, a distinct type for UTF-8 code units:
// C++20
char8_t c = u8'A'; // UTF-8 code unit
const char8_t* s = u8"Hello, 世界"; // UTF-8 string
Before C++20, u8"..." literals produced const char*. The introduction of
char8_t as a distinct type broke backward compatibility -- existing code using
u8 strings with std::string stopped compiling. This remains controversial.
String Literal Prefixes Summary
| Prefix | Type | Encoding | Standard |
|---|---|---|---|
| none | const char* |
Implementation-defined (usually UTF-8) | C89 / C++98 |
L |
const wchar_t* |
Implementation-defined | C89 / C++98 |
u8 |
const char* (pre-C++20) / const char8_t* (C++20) |
UTF-8 | C11 / C++11 |
u |
const char16_t* |
UTF-16 | C11 / C++11 |
U |
const char32_t* |
UTF-32 | C11 / C++11 |
Unicode Escapes
Both C and C++ support Unicode escape sequences in string and character literals:
// Universal Character Names (UCN)
char euro[] = "\u20AC"; // € (4-digit hex)
char emoji[] = "\U0001F600"; // 😀 (8-digit hex)
char16_t ch16 = u'\u03C3'; // σ (U+03C3)
char32_t ch32 = U'\U0001F3B5'; // 🎵 (U+1F3B5)
The \u form requires exactly 4 hex digits (BMP only). The \U form
requires exactly 8 hex digits (full Unicode range).
Named Character Escapes (C++23)
C++23 introduced named character escapes using the Unicode character name:
// C++23
char32_t snowman = U'\N{SNOWMAN}'; // U+2603 ☃
const char* s = "\N{GREEK SMALL LETTER SIGMA}"; // σ
This dramatically improves readability for code that embeds special characters.
Working with UTF-8 in C
The most practical approach to Unicode in C is to use plain char strings
with UTF-8 encoding. Most modern C libraries, POSIX APIs, and network
protocols use UTF-8 natively.
String Length: Bytes vs. Characters
#include <string.h>
#include <stdio.h>
const char* s = "café";
printf("byte length: %zu\n", strlen(s)); // 5 (includes 2-byte é)
// To count UTF-8 code points, you need a helper function:
size_t utf8_strlen(const char* s) {
size_t count = 0;
while (*s) {
// Count bytes that are NOT continuation bytes (0x80-0xBF)
if ((*s & 0xC0) != 0x80) count++;
s++;
}
return count;
}
printf("char count: %zu\n", utf8_strlen("café")); // 4
Iterating Over UTF-8 Code Points
#include <stdint.h>
// Decode one UTF-8 code point, advance the pointer
uint32_t utf8_decode(const char** p) {
const unsigned char* s = (const unsigned char*)*p;
uint32_t cp;
int bytes;
if (s[0] < 0x80) {
cp = s[0]; bytes = 1;
} else if ((s[0] & 0xE0) == 0xC0) {
cp = s[0] & 0x1F; bytes = 2;
} else if ((s[0] & 0xF0) == 0xE0) {
cp = s[0] & 0x0F; bytes = 3;
} else {
cp = s[0] & 0x07; bytes = 4;
}
for (int i = 1; i < bytes; i++) {
cp = (cp << 6) | (s[i] & 0x3F);
}
*p += bytes;
return cp;
}
In practice, use a library like ICU or utf8proc rather than hand-rolling UTF-8 decoding.
Working with UTF-8 in C++
std::string with UTF-8
std::string is a container of char bytes with no encoding awareness.
You can store UTF-8 in it, but size(), operator[], and iterators all
operate on bytes:
#include <string>
std::string s = u8"café"; // UTF-8 bytes in a std::string
s.size(); // 5 (bytes)
s[3]; // 0xC3 (first byte of é, NOT the character)
std::u8string (C++20)
C++20 provides std::u8string (backed by char8_t) as a type-safe
UTF-8 string, but its utility is limited because the standard library
offers no Unicode-aware operations on it:
std::u8string s = u8"café";
s.size(); // 5 (still bytes)
// No built-in way to count code points or iterate characters
std::u16string and std::u32string
std::u16string s16 = u"Hello, 世界"; // UTF-16
std::u32string s32 = U"Hello, 世界"; // UTF-32
s32.size(); // 9 (code points -- finally a useful character count!)
s32[7]; // U'界' (U+754C)
std::u32string is the only standard string type where size() equals the
code point count.
Conversion Between Encodings
C11: mbrtoc16 / mbrtoc32
C11 provides functions to convert between multi-byte (typically UTF-8) and
char16_t / char32_t:
#include <uchar.h>
#include <locale.h>
setlocale(LC_ALL, "en_US.UTF-8");
const char* utf8 = "é";
char32_t cp;
mbstate_t state = {0};
mbrtoc32(&cp, utf8, strlen(utf8), &state);
// cp = 0x00E9 (U+00E9)
C++11: std::codecvt (Deprecated in C++17)
The <codecvt> header was deprecated in C++17 and removed in C++26 because
it had design flaws and poor platform support:
// DEPRECATED -- do not use in new code
#include <codecvt>
#include <locale>
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
std::u32string u32 = converter.from_bytes("café");
The recommended replacement is ICU, or on C++20+, a third-party library.
Libraries for Unicode in C/C++
The standard library provides only basic building blocks. For serious Unicode work, use one of these libraries:
| Library | Features | License |
|---|---|---|
| ICU (International Components for Unicode) | Full Unicode support: normalization, collation, regex, BiDi, transliteration | ICU License (permissive) |
| utf8proc (Julia project) | Lightweight: normalization, category lookup, case folding | MIT |
| simdutf | Blazing-fast UTF-8/UTF-16/UTF-32 validation and conversion | Apache-2.0 / MIT |
| utfcpp | Header-only UTF-8 iteration and validation for C++ | BSL-1.0 |
ICU Example
#include <unicode/unistr.h> // ICU C++ API
#include <unicode/normlzr.h>
icu::UnicodeString s = icu::UnicodeString::fromUTF8("café");
s.length(); // 4 (code units / BMP code points)
// Normalization
UErrorCode err = U_ZERO_ERROR;
icu::UnicodeString nfc;
Normalizer::normalize(s, UNORM_NFC, 0, nfc, err);
// Case conversion with locale
s.toUpper(Locale("de")); // locale-aware uppercase
Compiler and Source Encoding
Modern compilers assume UTF-8 source files by default:
| Compiler | Default Source Encoding | Flag to Set |
|---|---|---|
| GCC | UTF-8 | -finput-charset=UTF-8 |
| Clang | UTF-8 | -finput-charset=UTF-8 |
| MSVC | System locale (not UTF-8!) | /utf-8 (sets both source and execution charset) |
Always pass /utf-8 to MSVC to avoid encoding issues on Windows. Without
it, string literals may be interpreted as the system's legacy code page.
# GCC / Clang (usually unnecessary, but explicit)
g++ -finput-charset=UTF-8 -fexec-charset=UTF-8 main.cpp
# MSVC (essential)
cl /utf-8 main.cpp
Common Pitfalls
1. strlen() Counts Bytes, Not Characters
strlen("café"); // 5, not 4
strlen("日本語"); // 9, not 3
strlen("😀"); // 4, not 1
2. wchar_t Is Not Portable
Code that uses wchar_t for "Unicode support" will behave differently on
Windows (2-byte UTF-16) and Linux (4-byte UTF-32).
3. Splitting UTF-8 Strings at Arbitrary Byte Offsets
char s[] = "café";
s[4] = '\0'; // Splits the 2-byte é → invalid UTF-8
4. Assuming One Byte Per Character
Any code that uses char indexing to access "the nth character" is broken for
non-ASCII UTF-8 text.
Quick Reference
| Task | C | C++ |
|---|---|---|
| UTF-8 literal | "text" or u8"text" |
"text" or u8"text" |
| UTF-16 literal | u"text" |
u"text" |
| UTF-32 literal | U"text" |
U"text" |
| UTF-8 char type | char |
char8_t (C++20) |
| Byte length | strlen(s) |
s.size() |
| Code point type | char32_t |
char32_t |
| Normalize | ICU / utf8proc | ICU / utf8proc |
| Validate UTF-8 | Manual / simdutf | Manual / simdutf |
| MSVC flag | N/A | /utf-8 |
Unicode support in C and C++ has improved dramatically since C11/C++11, but it
remains a story of building blocks rather than a complete solution. The standard
provides the types and literals (char8_t, char16_t, char32_t, u8/u/U
prefixes, named escapes in C++23), but full-featured Unicode processing --
normalization, segmentation, collation, BiDi -- still requires external libraries
like ICU. The single most impactful step you can take is to standardize on UTF-8
for all text, pass /utf-8 to MSVC, and use a well-tested library for anything
beyond basic string storage.
Unicode in Code のその他のガイド
Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …
JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …
Java's char type is a 16-bit UTF-16 code unit, not a full …
Go's string type is a sequence of bytes, and its rune type …
Rust's str and String types are guaranteed to be valid UTF-8, making …
Ruby strings carry an explicit encoding, with UTF-8 being the default since …
PHP's built-in string functions operate on bytes rather than Unicode characters, which …
Swift's String type is designed with Unicode correctness as a first-class concern, …
HTML and CSS support Unicode characters directly and through escape sequences, allowing …
Unicode-aware regular expressions let you match characters by script, category, or property …
SQL databases store text in encodings and collations that determine how characters …
URLs are technically restricted to ASCII characters, so non-ASCII text must be …
Every major programming language has its own syntax for embedding Unicode characters …
JSON is defined as Unicode text and must be encoded in UTF-8, …