💻 Unicode in Code

Unicode in C/C++

C and C++ have historically poor Unicode support, with char being a single byte and wchar_t having platform-dependent width, leading to decades of encoding bugs. This guide covers modern approaches to Unicode in C/C++, including char8_t, char16_t, char32_t, and the ICU library.

Published 2022-04-04 · Updated 2024-09-23

C and C++ have a complicated relationship with Unicode. Both languages predate the Unicode standard, and their original character types -- char (typically 8 bits) and wchar_t (platform-dependent width) -- were designed for a world of single-byte encodings. Over the past two decades, both standards have added new types and literals for Unicode, but the ecosystem remains fragmented. Understanding the current state of Unicode in C and C++ requires knowing which features are available in which standard version, and which libraries fill the remaining gaps.

The Legacy: char and wchar_t

char

The char type is at least 8 bits. It holds a single byte, which means it can represent one ASCII character or one byte of a multi-byte encoding like UTF-8:

char ascii = 'A';            // 0x41
char utf8[] = "café";        // 6 bytes: 63 61 66 C3 A9 00
printf("%zu\n", strlen(utf8)); // 5 (bytes, not characters)

The C standard says nothing about the encoding of char strings. On modern systems, char strings are typically UTF-8, but this is a convention, not a guarantee.

wchar_t

wchar_t was intended as a "wide character" type large enough to hold any character in the system's character set. Unfortunately, its size varies by platform:

Platform	`wchar_t` Size	Encoding
Linux / macOS	4 bytes	UTF-32
Windows	2 bytes	UTF-16

This makes wchar_t non-portable for Unicode work. A wchar_t on Windows cannot hold a supplementary character (U+10000 and above) in a single value, while on Linux it can. Code that uses wchar_t behaves differently on each platform.

wchar_t w = L'A';               // wide character literal
wchar_t ws[] = L"Hello";        // wide string literal

Modern Unicode Types (C11 / C++11 and Later)

char16_t and char32_t

C11 and C++11 introduced two new character types with fixed, portable widths:

Type	Size	Encoding	Literal Prefix
`char16_t`	16 bits	UTF-16	`u`
`char32_t`	32 bits	UTF-32	`U`

#include <uchar.h>       // C11
// or in C++:
// #include <cuchar>

char16_t utf16_char = u'A';               // U+0041
char32_t utf32_char = U'😀';              // U+1F600

char16_t utf16_str[] = u"Hello, 世界";
char32_t utf32_str[] = U"Hello, 世界";

char32_t can always hold a complete Unicode code point. char16_t may require a surrogate pair for supplementary characters, just like Java's char.

char8_t (C++20)

C++20 added char8_t, a distinct type for UTF-8 code units:

// C++20
char8_t c = u8'A';                        // UTF-8 code unit
const char8_t* s = u8"Hello, 世界";       // UTF-8 string

Before C++20, u8"..." literals produced const char*. The introduction of char8_t as a distinct type broke backward compatibility -- existing code using u8 strings with std::string stopped compiling. This remains controversial.

String Literal Prefixes Summary

Prefix	Type	Encoding	Standard
none	`const char*`	Implementation-defined (usually UTF-8)	C89 / C++98
`L`	`const wchar_t*`	Implementation-defined	C89 / C++98
`u8`	`const char` (pre-C++20) / `const char8_t` (C++20)	UTF-8	C11 / C++11
`u`	`const char16_t*`	UTF-16	C11 / C++11
`U`	`const char32_t*`	UTF-32	C11 / C++11

Unicode Escapes

Both C and C++ support Unicode escape sequences in string and character literals:

// Universal Character Names (UCN)
char euro[]    = "\u20AC";            // € (4-digit hex)
char emoji[]   = "\U0001F600";        // 😀 (8-digit hex)
char16_t ch16  = u'\u03C3';          // σ (U+03C3)
char32_t ch32  = U'\U0001F3B5';      // 🎵 (U+1F3B5)

The \u form requires exactly 4 hex digits (BMP only). The \U form requires exactly 8 hex digits (full Unicode range).

Named Character Escapes (C++23)

C++23 introduced named character escapes using the Unicode character name:

// C++23
char32_t snowman = U'\N{SNOWMAN}';            // U+2603 ☃
const char* s = "\N{GREEK SMALL LETTER SIGMA}"; // σ

This dramatically improves readability for code that embeds special characters.

Working with UTF-8 in C

The most practical approach to Unicode in C is to use plain char strings with UTF-8 encoding. Most modern C libraries, POSIX APIs, and network protocols use UTF-8 natively.

String Length: Bytes vs. Characters

#include <string.h>
#include <stdio.h>

const char* s = "café";
printf("byte length: %zu\n", strlen(s));  // 5 (includes 2-byte é)

// To count UTF-8 code points, you need a helper function:
size_t utf8_strlen(const char* s) {
    size_t count = 0;
    while (*s) {
        // Count bytes that are NOT continuation bytes (0x80-0xBF)
        if ((*s & 0xC0) != 0x80) count++;
        s++;
    }
    return count;
}

printf("char count: %zu\n", utf8_strlen("café"));  // 4

Iterating Over UTF-8 Code Points

#include <stdint.h>

// Decode one UTF-8 code point, advance the pointer
uint32_t utf8_decode(const char** p) {
    const unsigned char* s = (const unsigned char*)*p;
    uint32_t cp;
    int bytes;

    if (s[0] < 0x80) {
        cp = s[0]; bytes = 1;
    } else if ((s[0] & 0xE0) == 0xC0) {
        cp = s[0] & 0x1F; bytes = 2;
    } else if ((s[0] & 0xF0) == 0xE0) {
        cp = s[0] & 0x0F; bytes = 3;
    } else {
        cp = s[0] & 0x07; bytes = 4;
    }
    for (int i = 1; i < bytes; i++) {
        cp = (cp << 6) | (s[i] & 0x3F);
    }
    *p += bytes;
    return cp;
}

In practice, use a library like ICU or utf8proc rather than hand-rolling UTF-8 decoding.

Working with UTF-8 in C++

std::string with UTF-8

std::string is a container of char bytes with no encoding awareness. You can store UTF-8 in it, but size(), operator[], and iterators all operate on bytes:

#include <string>

std::string s = u8"café";          // UTF-8 bytes in a std::string
s.size();                          // 5 (bytes)
s[3];                              // 0xC3 (first byte of é, NOT the character)

std::u8string (C++20)

C++20 provides std::u8string (backed by char8_t) as a type-safe UTF-8 string, but its utility is limited because the standard library offers no Unicode-aware operations on it:

std::u8string s = u8"café";
s.size();                          // 5 (still bytes)
// No built-in way to count code points or iterate characters

std::u16string and std::u32string

std::u16string s16 = u"Hello, 世界";    // UTF-16
std::u32string s32 = U"Hello, 世界";    // UTF-32

s32.size();   // 9 (code points -- finally a useful character count!)
s32[7];       // U'界' (U+754C)

std::u32string is the only standard string type where size() equals the code point count.

Conversion Between Encodings

C11: mbrtoc16 / mbrtoc32

C11 provides functions to convert between multi-byte (typically UTF-8) and char16_t / char32_t:

#include <uchar.h>
#include <locale.h>

setlocale(LC_ALL, "en_US.UTF-8");

const char* utf8 = "é";
char32_t cp;
mbstate_t state = {0};
mbrtoc32(&cp, utf8, strlen(utf8), &state);
// cp = 0x00E9 (U+00E9)

C++11: std::codecvt (Deprecated in C++17)

The <codecvt> header was deprecated in C++17 and removed in C++26 because it had design flaws and poor platform support:

// DEPRECATED -- do not use in new code
#include <codecvt>
#include <locale>

std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
std::u32string u32 = converter.from_bytes("café");

The recommended replacement is ICU, or on C++20+, a third-party library.

Libraries for Unicode in C/C++

The standard library provides only basic building blocks. For serious Unicode work, use one of these libraries:

Library	Features	License
ICU (International Components for Unicode)	Full Unicode support: normalization, collation, regex, BiDi, transliteration	ICU License (permissive)
utf8proc (Julia project)	Lightweight: normalization, category lookup, case folding	MIT
simdutf	Blazing-fast UTF-8/UTF-16/UTF-32 validation and conversion	Apache-2.0 / MIT
utfcpp	Header-only UTF-8 iteration and validation for C++	BSL-1.0

ICU Example

#include <unicode/unistr.h>    // ICU C++ API
#include <unicode/normlzr.h>

icu::UnicodeString s = icu::UnicodeString::fromUTF8("café");
s.length();                    // 4 (code units / BMP code points)

// Normalization
UErrorCode err = U_ZERO_ERROR;
icu::UnicodeString nfc;
Normalizer::normalize(s, UNORM_NFC, 0, nfc, err);

// Case conversion with locale
s.toUpper(Locale("de"));      // locale-aware uppercase

Compiler and Source Encoding

Modern compilers assume UTF-8 source files by default:

Compiler	Default Source Encoding	Flag to Set
GCC	UTF-8	`-finput-charset=UTF-8`
Clang	UTF-8	`-finput-charset=UTF-8`
MSVC	System locale (not UTF-8!)	`/utf-8` (sets both source and execution charset)

Always pass /utf-8 to MSVC to avoid encoding issues on Windows. Without it, string literals may be interpreted as the system's legacy code page.

# GCC / Clang (usually unnecessary, but explicit)
g++ -finput-charset=UTF-8 -fexec-charset=UTF-8 main.cpp

# MSVC (essential)
cl /utf-8 main.cpp

Common Pitfalls

1. strlen() Counts Bytes, Not Characters

strlen("café");      // 5, not 4
strlen("日本語");    // 9, not 3
strlen("😀");        // 4, not 1

2. wchar_t Is Not Portable

Code that uses wchar_t for "Unicode support" will behave differently on Windows (2-byte UTF-16) and Linux (4-byte UTF-32).

3. Splitting UTF-8 Strings at Arbitrary Byte Offsets

char s[] = "café";
s[4] = '\0';            // Splits the 2-byte é → invalid UTF-8

4. Assuming One Byte Per Character

Any code that uses char indexing to access "the nth character" is broken for non-ASCII UTF-8 text.

Quick Reference

Task	C	C++
UTF-8 literal	`"text"` or `u8"text"`	`"text"` or `u8"text"`
UTF-16 literal	`u"text"`	`u"text"`
UTF-32 literal	`U"text"`	`U"text"`
UTF-8 char type	`char`	`char8_t` (C++20)
Byte length	`strlen(s)`	`s.size()`
Code point type	`char32_t`	`char32_t`
Normalize	ICU / utf8proc	ICU / utf8proc
Validate UTF-8	Manual / simdutf	Manual / simdutf
MSVC flag	N/A	`/utf-8`

Unicode support in C and C++ has improved dramatically since C11/C++11, but it remains a story of building blocks rather than a complete solution. The standard provides the types and literals (char8_t, char16_t, char32_t, u8/u/U prefixes, named escapes in C++23), but full-featured Unicode processing -- normalization, segmentation, collation, BiDi -- still requires external libraries like ICU. The single most impactful step you can take is to standardize on UTF-8 for all text, pass /utf-8 to MSVC, and use a well-tested library for anything beyond basic string storage.

Unicode in Code のその他のガイド

Unicode in Python

Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …

Unicode in JavaScript

JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …

Unicode in Java

Java's char type is a 16-bit UTF-16 code unit, not a full …

Unicode in Go

Go's string type is a sequence of bytes, and its rune type …

Unicode in Rust

Rust's str and String types are guaranteed to be valid UTF-8, making …

Unicode in Ruby

Ruby strings carry an explicit encoding, with UTF-8 being the default since …

Unicode in PHP

PHP's built-in string functions operate on bytes rather than Unicode characters, which …

Unicode in Swift

Swift's String type is designed with Unicode correctness as a first-class concern, …

Unicode in HTML & CSS

HTML and CSS support Unicode characters directly and through escape sequences, allowing …

Unicode in Regular Expressions

Unicode-aware regular expressions let you match characters by script, category, or property …

Unicode in SQL

SQL databases store text in encodings and collations that determine how characters …

Unicode in URLs

URLs are technically restricted to ASCII characters, so non-ASCII text must be …

Unicode Escape Sequences: Cross-Language Reference

Every major programming language has its own syntax for embedding Unicode characters …

How to Handle Unicode in APIs and JSON

JSON is defined as Unicode text and must be encoded in UTF-8, …

← ガイド一覧へ