The Unicode Odyssey · الفصل 6

Unicode in Your Programming Language

Every programming language handles Unicode differently. This chapter compares string internals, iteration, and slicing across Python, JavaScript, Java, and Rust — with practical examples and gotchas.

~4000 كلمة · ~16 دقيقة قراءة · · Updated

Every programming language makes decisions about how to represent and manipulate text internally — decisions that shape how you, the programmer, experience Unicode. Some languages made these decisions before Unicode existed and carry the weight of historical choices. Others were designed with Unicode from the start. Understanding your language's Unicode model is essential to writing correct internationalization code — and to understanding why bugs that seem impossible actually happen all the time.

Python: From Latin-1 to Full Unicode

Python 2 treated strings as byte sequences by default — str was bytes, unicode was the special type for Unicode text. This caused endless confusion, as mixing the two types would trigger implicit decoding using the ASCII codec, producing UnicodeDecodeError at runtime whenever non-ASCII bytes were encountered.

Python 3 made the clean break: str is always Unicode text (a sequence of codepoints), and bytes is the separate type for raw binary data. The conversion between them is explicit and requires specifying an encoding:

s = "café"
b = s.encode("utf-8")      # bytes: b'caf\\xc3\\xa9'
s2 = b.decode("utf-8")     # back to str: "café"

Python 3 strings are sequences of codepoints internally. However, the internal representation (CPython implementation detail) is flexible — PEP 393 introduced a "compact" representation that uses 1 byte per character for strings whose characters all fit in Latin-1, 2 bytes per character if all fit in the BMP, and 4 bytes per character for strings with supplementary plane characters. This optimizes memory while presenting a uniform codepoint-sequence interface.

s = "café"
len(s)           # 4 (codepoints)
s[0]             # 'c'
s[-1]            # 'é'

# But codepoint != grapheme cluster:
s2 = "cafe\\u0301"   # e + combining acute
len(s2)          # 5 (5 codepoints)
# To count grapheme clusters, use the 'grapheme' library

Python's str.encode() and bytes.decode() support over 100 codecs, including all the legacy encodings from the code page era — a necessary feature for processing historical data.

JavaScript: The UTF-16 Legacy

JavaScript (ECMAScript) specifies strings as sequences of 16-bit code units — essentially, UTF-16 without the BOM. This was a consequence of the timing: JavaScript was created in 1995, when the Unicode Consortium still believed all characters would fit in the BMP, and 16-bit code units seemed sufficient.

When supplementary plane characters (emoji, rare CJK, mathematical symbols) were added to Unicode, JavaScript's string model became a source of endless confusion:

const emoji = "\\u{1F600}";   // 😀 GRINNING FACE
emoji.length;                  // 2 (two UTF-16 surrogate code units)
emoji[0];                      // "\\uD83D" (high surrogate — not a valid character)
emoji[1];                      // "\\uDE00" (low surrogate)
emoji.charCodeAt(0);           // 55357 (0xD83D)
emoji.codePointAt(0);          // 128512 (0x1F600, correct)
[...emoji].length;             // 1 (spread iterates codepoints)

The .codePointAt() method (ES6) and the u flag for regular expressions (ES6) are the tools for working correctly with supplementary plane characters. String spread [...str] iterates codepoints. But many APIs — .slice(), .charAt(), .indexOf(), string destructuring without spread — operate on code units and can split surrogate pairs.

For grapheme cluster-aware operations, the Intl.Segmenter API (ES2022) is the standard solution:

const segmenter = new Intl.Segmenter();
const clusters = [...segmenter.segment("caf\\u0065\\u0301")];
clusters.length;  // 4 (grapheme clusters)

JavaScript's TextEncoder and TextDecoder APIs (Web Workers, Node.js) handle encoding conversion, defaulting to UTF-8.

Java: UTF-16 Strings, Evolving

Java was also designed when UTF-16 seemed sufficient. char in Java is a 16-bit unsigned integer representing a UTF-16 code unit. String is a sequence of chars.

The consequences mirror JavaScript's: supplementary plane characters require two char values (a surrogate pair), and code that processes strings character-by-character using charAt() can break on emoji and rare CJK characters.

Java provides proper Unicode-aware alternatives:

String s = new String(Character.toChars(0x1F600));  // 😀
s.length();              // 2 (code units)
s.codePointCount(0, 2); // 1 (codepoints)
s.codePoints().count();  // 1 (stream of codepoints)

Java's Character class provides methods for working with codepoints: isHighSurrogate(), isLowSurrogate(), toCodePoint(), isSupplementaryCodePoint().

Recent Java versions (19+, preview; stabilizing) have introduced compact strings optimization similar to Python's, using Latin-1 encoding internally when all characters fit, UTF-16 otherwise.

Rust: Validated UTF-8, Always

Rust takes the strongest stance of any mainstream language: str and String are guaranteed to be valid UTF-8. The compiler enforces this invariant — it's not a convention, it's a type guarantee.

let s = "café";       // str literal, UTF-8 bytes in binary
s.len();              // 5 (bytes, not chars!)
s.chars().count();    // 4 (Unicode scalar values / codepoints)

Note the distinction: .len() returns byte length, .chars() iterates codepoints (Rust calls them "chars" — actually Unicode scalar values, excluding surrogates). For grapheme clusters, the unicode-segmentation crate is needed:

use unicode_segmentation::UnicodeSegmentation;
let clusters: Vec<&str> = "café".graphemes(true).collect();
clusters.len();  // 4

String indexing by byte position is the default: s[0..3] returns the first 3 bytes. Indexing at a non-character boundary panics, which is intentional — the language forces you to think about what you're indexing.

Go: UTF-8 Native, Simple Model

Go stores strings as byte slices with no encoding guarantee in the type system — but by convention and idiom, Go strings are UTF-8. The rune type (alias for int32) represents a Unicode codepoint.

s := "café"
len(s)                        // 5 (bytes in UTF-8)
len([]rune(s))                // 4 (rune/codepoint count)

for i, r := range s {         // range iterates runes
    fmt.Printf("%d: %c (U+%04X)\\n", i, r, r)
}
// 0: c (U+0063)
// 1: a (U+0061)
// 2: f (U+0066)
// 3: é (U+00E9)  -- starts at byte 3, is 2 bytes

The range loop over a string automatically decodes UTF-8 and yields codepoints. This is idiomatic Go: strings as bytes for storage, runes for character processing. For grapheme clusters, the golang.org/x/text/unicode/norm and related packages provide normalization and segmentation.

Swift: Extended Grapheme Clusters as Default

Swift made a philosophically interesting choice: Character in Swift is a grapheme cluster, not a codepoint. This is arguably the most linguistically correct approach of any mainstream language — users think in grapheme clusters, so Swift's basic character type reflects that.

let s = "café"
s.count            // 4 (grapheme clusters)
let e = s.last!    // "é" (one Character, possibly multiple codepoints)

// The family emoji:
let family = "\\u{1F468}\\u{200D}\\u{1F469}\\u{200D}\\u{1F467}\\u{200D}\\u{1F466}"
family.count       // 1 (one grapheme cluster!)

Internally, Swift strings are UTF-8. Multiple views are available:

s.utf8.count        // bytes (5 for "café")
s.utf16.count       // UTF-16 code units
s.unicodeScalars.count  // Unicode scalar values (codepoints)
s.count             // grapheme clusters (user-perceived characters)

Swift's approach is elegant for user-facing applications but can surprise developers accustomed to other languages' string models.

C/C++: The Wild West

C and C++ have multiple string abstractions with different Unicode properties:

  • char* / std::string: Nominally bytes; by convention often UTF-8 in modern code, but not enforced
  • wchar_t / std::wstring: Platform-dependent width — 2 bytes (UTF-16) on Windows, 4 bytes (UTF-32) on Linux/macOS
  • char16_t / std::u16string: 16-bit UTF-16 code units (C++11)
  • char32_t / std::u32string: 32-bit UTF-32 codepoints (C++11)
  • char8_t / std::u8string: 8-bit UTF-8 code units (C++20)

The consequence of this multiplicity: every C/C++ project that handles internationalized text essentially rolls its own Unicode policy. The ICU (International Components for Unicode) library provides a comprehensive implementation for those who want correct behavior, but its integration is non-trivial.

Common Mistakes Across Languages

Mistake 1: Using byte/code-unit length for user-facing character count Use grapheme cluster count for anything users see as "N characters."

Mistake 2: String slicing without character boundary awareness Always slice at character boundaries (codepoints at minimum, grapheme clusters ideally). Slicing in the middle of a multibyte sequence produces garbled text or runtime errors.

Mistake 3: Case conversion without locale Turkish has dotted and dotless I — 'i'.toUpperCase() should return 'İ' in Turkish locale, not 'I'. Most languages' case conversion functions accept locale parameters.

Mistake 4: Comparing strings without normalization "é" as U+00E9 and "é" as U+0065 U+0301 compare unequal in most languages unless you normalize first. Always normalize before comparison or database storage.

Mistake 5: Assuming all bytes in a UTF-8 string are valid Receiving data from external sources (files, network, user input) can produce invalid UTF-8. Validate and handle errors explicitly rather than assuming correctness.

Understanding your language's string model isn't an academic exercise — it's the foundation of writing software that handles international text correctly, and the key to understanding why certain bugs appear only with certain inputs.