💻 Unicode in Code

Unicode in C/C++

C and C++ have historically poor Unicode support, with char being a single byte and wchar_t having platform-dependent width, leading to decades of encoding bugs. This guide covers modern approaches to Unicode in C/C++, including char8_t, char16_t, char32_t, and the ICU library.

·

C and C++ have a complicated relationship with Unicode. Both languages predate the Unicode standard, and their original character types -- char (typically 8 bits) and wchar_t (platform-dependent width) -- were designed for a world of single-byte encodings. Over the past two decades, both standards have added new types and literals for Unicode, but the ecosystem remains fragmented. Understanding the current state of Unicode in C and C++ requires knowing which features are available in which standard version, and which libraries fill the remaining gaps.

The Legacy: char and wchar_t

char

The char type is at least 8 bits. It holds a single byte, which means it can represent one ASCII character or one byte of a multi-byte encoding like UTF-8:

char ascii = 'A';            // 0x41
char utf8[] = "café";        // 6 bytes: 63 61 66 C3 A9 00
printf("%zu\n", strlen(utf8)); // 5 (bytes, not characters)

The C standard says nothing about the encoding of char strings. On modern systems, char strings are typically UTF-8, but this is a convention, not a guarantee.

wchar_t

wchar_t was intended as a "wide character" type large enough to hold any character in the system's character set. Unfortunately, its size varies by platform:

Platform wchar_t Size Encoding
Linux / macOS 4 bytes UTF-32
Windows 2 bytes UTF-16

This makes wchar_t non-portable for Unicode work. A wchar_t on Windows cannot hold a supplementary character (U+10000 and above) in a single value, while on Linux it can. Code that uses wchar_t behaves differently on each platform.

wchar_t w = L'A';               // wide character literal
wchar_t ws[] = L"Hello";        // wide string literal

Modern Unicode Types (C11 / C++11 and Later)

char16_t and char32_t

C11 and C++11 introduced two new character types with fixed, portable widths:

Type Size Encoding Literal Prefix
char16_t 16 bits UTF-16 u
char32_t 32 bits UTF-32 U
#include <uchar.h>       // C11
// or in C++:
// #include <cuchar>

char16_t utf16_char = u'A';               // U+0041
char32_t utf32_char = U'😀';              // U+1F600

char16_t utf16_str[] = u"Hello, 世界";
char32_t utf32_str[] = U"Hello, 世界";

char32_t can always hold a complete Unicode code point. char16_t may require a surrogate pair for supplementary characters, just like Java's char.

char8_t (C++20)

C++20 added char8_t, a distinct type for UTF-8 code units:

// C++20
char8_t c = u8'A';                        // UTF-8 code unit
const char8_t* s = u8"Hello, 世界";       // UTF-8 string

Before C++20, u8"..." literals produced const char*. The introduction of char8_t as a distinct type broke backward compatibility -- existing code using u8 strings with std::string stopped compiling. This remains controversial.

String Literal Prefixes Summary

Prefix Type Encoding Standard
none const char* Implementation-defined (usually UTF-8) C89 / C++98
L const wchar_t* Implementation-defined C89 / C++98
u8 const char* (pre-C++20) / const char8_t* (C++20) UTF-8 C11 / C++11
u const char16_t* UTF-16 C11 / C++11
U const char32_t* UTF-32 C11 / C++11

Unicode Escapes

Both C and C++ support Unicode escape sequences in string and character literals:

// Universal Character Names (UCN)
char euro[]    = "\u20AC";            // € (4-digit hex)
char emoji[]   = "\U0001F600";        // 😀 (8-digit hex)
char16_t ch16  = u'\u03C3';          // σ (U+03C3)
char32_t ch32  = U'\U0001F3B5';      // 🎵 (U+1F3B5)

The \u form requires exactly 4 hex digits (BMP only). The \U form requires exactly 8 hex digits (full Unicode range).

Named Character Escapes (C++23)

C++23 introduced named character escapes using the Unicode character name:

// C++23
char32_t snowman = U'\N{SNOWMAN}';            // U+2603 ☃
const char* s = "\N{GREEK SMALL LETTER SIGMA}"; // σ

This dramatically improves readability for code that embeds special characters.

Working with UTF-8 in C

The most practical approach to Unicode in C is to use plain char strings with UTF-8 encoding. Most modern C libraries, POSIX APIs, and network protocols use UTF-8 natively.

String Length: Bytes vs. Characters

#include <string.h>
#include <stdio.h>

const char* s = "café";
printf("byte length: %zu\n", strlen(s));  // 5 (includes 2-byte é)

// To count UTF-8 code points, you need a helper function:
size_t utf8_strlen(const char* s) {
    size_t count = 0;
    while (*s) {
        // Count bytes that are NOT continuation bytes (0x80-0xBF)
        if ((*s & 0xC0) != 0x80) count++;
        s++;
    }
    return count;
}

printf("char count: %zu\n", utf8_strlen("café"));  // 4

Iterating Over UTF-8 Code Points

#include <stdint.h>

// Decode one UTF-8 code point, advance the pointer
uint32_t utf8_decode(const char** p) {
    const unsigned char* s = (const unsigned char*)*p;
    uint32_t cp;
    int bytes;

    if (s[0] < 0x80) {
        cp = s[0]; bytes = 1;
    } else if ((s[0] & 0xE0) == 0xC0) {
        cp = s[0] & 0x1F; bytes = 2;
    } else if ((s[0] & 0xF0) == 0xE0) {
        cp = s[0] & 0x0F; bytes = 3;
    } else {
        cp = s[0] & 0x07; bytes = 4;
    }
    for (int i = 1; i < bytes; i++) {
        cp = (cp << 6) | (s[i] & 0x3F);
    }
    *p += bytes;
    return cp;
}

In practice, use a library like ICU or utf8proc rather than hand-rolling UTF-8 decoding.

Working with UTF-8 in C++

std::string with UTF-8

std::string is a container of char bytes with no encoding awareness. You can store UTF-8 in it, but size(), operator[], and iterators all operate on bytes:

#include <string>

std::string s = u8"café";          // UTF-8 bytes in a std::string
s.size();                          // 5 (bytes)
s[3];                              // 0xC3 (first byte of é, NOT the character)

std::u8string (C++20)

C++20 provides std::u8string (backed by char8_t) as a type-safe UTF-8 string, but its utility is limited because the standard library offers no Unicode-aware operations on it:

std::u8string s = u8"café";
s.size();                          // 5 (still bytes)
// No built-in way to count code points or iterate characters

std::u16string and std::u32string

std::u16string s16 = u"Hello, 世界";    // UTF-16
std::u32string s32 = U"Hello, 世界";    // UTF-32

s32.size();   // 9 (code points -- finally a useful character count!)
s32[7];       // U'界' (U+754C)

std::u32string is the only standard string type where size() equals the code point count.

Conversion Between Encodings

C11: mbrtoc16 / mbrtoc32

C11 provides functions to convert between multi-byte (typically UTF-8) and char16_t / char32_t:

#include <uchar.h>
#include <locale.h>

setlocale(LC_ALL, "en_US.UTF-8");

const char* utf8 = "é";
char32_t cp;
mbstate_t state = {0};
mbrtoc32(&cp, utf8, strlen(utf8), &state);
// cp = 0x00E9 (U+00E9)

C++11: std::codecvt (Deprecated in C++17)

The <codecvt> header was deprecated in C++17 and removed in C++26 because it had design flaws and poor platform support:

// DEPRECATED -- do not use in new code
#include <codecvt>
#include <locale>

std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
std::u32string u32 = converter.from_bytes("café");

The recommended replacement is ICU, or on C++20+, a third-party library.

Libraries for Unicode in C/C++

The standard library provides only basic building blocks. For serious Unicode work, use one of these libraries:

Library Features License
ICU (International Components for Unicode) Full Unicode support: normalization, collation, regex, BiDi, transliteration ICU License (permissive)
utf8proc (Julia project) Lightweight: normalization, category lookup, case folding MIT
simdutf Blazing-fast UTF-8/UTF-16/UTF-32 validation and conversion Apache-2.0 / MIT
utfcpp Header-only UTF-8 iteration and validation for C++ BSL-1.0

ICU Example

#include <unicode/unistr.h>    // ICU C++ API
#include <unicode/normlzr.h>

icu::UnicodeString s = icu::UnicodeString::fromUTF8("café");
s.length();                    // 4 (code units / BMP code points)

// Normalization
UErrorCode err = U_ZERO_ERROR;
icu::UnicodeString nfc;
Normalizer::normalize(s, UNORM_NFC, 0, nfc, err);

// Case conversion with locale
s.toUpper(Locale("de"));      // locale-aware uppercase

Compiler and Source Encoding

Modern compilers assume UTF-8 source files by default:

Compiler Default Source Encoding Flag to Set
GCC UTF-8 -finput-charset=UTF-8
Clang UTF-8 -finput-charset=UTF-8
MSVC System locale (not UTF-8!) /utf-8 (sets both source and execution charset)

Always pass /utf-8 to MSVC to avoid encoding issues on Windows. Without it, string literals may be interpreted as the system's legacy code page.

# GCC / Clang (usually unnecessary, but explicit)
g++ -finput-charset=UTF-8 -fexec-charset=UTF-8 main.cpp

# MSVC (essential)
cl /utf-8 main.cpp

Common Pitfalls

1. strlen() Counts Bytes, Not Characters

strlen("café");      // 5, not 4
strlen("日本語");    // 9, not 3
strlen("😀");        // 4, not 1

2. wchar_t Is Not Portable

Code that uses wchar_t for "Unicode support" will behave differently on Windows (2-byte UTF-16) and Linux (4-byte UTF-32).

3. Splitting UTF-8 Strings at Arbitrary Byte Offsets

char s[] = "café";
s[4] = '\0';            // Splits the 2-byte é → invalid UTF-8

4. Assuming One Byte Per Character

Any code that uses char indexing to access "the nth character" is broken for non-ASCII UTF-8 text.

Quick Reference

Task C C++
UTF-8 literal "text" or u8"text" "text" or u8"text"
UTF-16 literal u"text" u"text"
UTF-32 literal U"text" U"text"
UTF-8 char type char char8_t (C++20)
Byte length strlen(s) s.size()
Code point type char32_t char32_t
Normalize ICU / utf8proc ICU / utf8proc
Validate UTF-8 Manual / simdutf Manual / simdutf
MSVC flag N/A /utf-8

Unicode support in C and C++ has improved dramatically since C11/C++11, but it remains a story of building blocks rather than a complete solution. The standard provides the types and literals (char8_t, char16_t, char32_t, u8/u/U prefixes, named escapes in C++23), but full-featured Unicode processing -- normalization, segmentation, collation, BiDi -- still requires external libraries like ICU. The single most impactful step you can take is to standardize on UTF-8 for all text, pass /utf-8 to MSVC, and use a well-tested library for anything beyond basic string storage.

More in Unicode in Code