🛠️

The Developer's Unicode Handbook

Solving Real-World Unicode Bugs

A practical, code-heavy handbook that developers can reference when building internationalized applications. Each chapter solves a specific class of Unicode-related bugs.

8 chapters · 30,500 words · ~122 min read

String Length Is a Lie

Python's len(), JavaScript's .length, and Java's .length() all lie about string length. This chapter explains the difference between code units, code points, and grapheme clusters, with code examples for counting characters correctly.

~4,000 words · ~16 min

The Encoding Minefield

File I/O, HTTP headers, database collation, BOM detection — encoding issues lurk everywhere. This chapter provides a systematic approach to choosing and using encodings correctly across your entire stack.

~4,500 words · ~18 min

Comparison and Sorting

Sorting text correctly across languages requires understanding collation rules, locale sensitivity, and the Unicode Collation Algorithm. This chapter covers ICU collation, Python's locale, and JavaScript's Intl.Collator.

~4,000 words · ~16 min

Search That Actually Works

Unicode-aware search requires case folding, accent-insensitive matching, and normalization. This chapter covers practical techniques for building search that works across all languages.

~3,500 words · ~14 min

Input Validation Done Right

Validating international text is harder than it looks. Unicode categories, identifier rules, email address validation, and IDNA2008 — this chapter provides the definitive guide to Unicode-aware input validation.

~4,000 words · ~16 min

Rendering Complex Scripts

Text shaping turns code points into glyphs. This chapter explores HarfBuzz, CoreText, Pango, OpenType features, and the font fallback chain — everything you need to render text correctly.

~3,500 words · ~14 min

Security Hardening

Practical defensive techniques against Unicode attacks: confusable detection with ICU, normalization before comparison, bidi sandboxing, and secure identifier validation.

~4,000 words · ~16 min

Testing Unicode

A comprehensive guide to writing robust Unicode tests: edge case characters, boundary tests, fuzzing with Unicode data, and pytest fixtures for internationalized applications.

~3,000 words · ~12 min