The Developer's Unicode Handbook
Solving Real-World Unicode Bugs
A practical, code-heavy handbook that developers can reference when building internationalized applications. Each chapter solves a specific class of Unicode-related bugs.
String Length Is a Lie
Python's len(), JavaScript's .length, and Java's .length() all lie about string length. This chapter explains the difference between code units, code points, and grapheme clusters, with code examples for counting characters correctly.
The Encoding Minefield
File I/O, HTTP headers, database collation, BOM detection — encoding issues lurk everywhere. This chapter provides a systematic approach to choosing and using encodings correctly across your entire stack.
Comparison and Sorting
Sorting text correctly across languages requires understanding collation rules, locale sensitivity, and the Unicode Collation Algorithm. This chapter covers ICU collation, Python's locale, and JavaScript's Intl.Collator.
Search That Actually Works
Unicode-aware search requires case folding, accent-insensitive matching, and normalization. This chapter covers practical techniques for building search that works across all languages.
Input Validation Done Right
Validating international text is harder than it looks. Unicode categories, identifier rules, email address validation, and IDNA2008 — this chapter provides the definitive guide to Unicode-aware input validation.
Rendering Complex Scripts
Text shaping turns code points into glyphs. This chapter explores HarfBuzz, CoreText, Pango, OpenType features, and the font fallback chain — everything you need to render text correctly.
Security Hardening
Practical defensive techniques against Unicode attacks: confusable detection with ICU, normalization before comparison, bidi sandboxing, and secure identifier validation.
Testing Unicode
A comprehensive guide to writing robust Unicode tests: edge case characters, boundary tests, fuzzing with Unicode data, and pytest fixtures for internationalized applications.