Advanced Topics
Deep technical topics for experienced developers
5 guides in this series
Sorting text correctly across languages requires the Unicode Collation Algorithm (UCA), which defines a multi-level comparison scheme that handles accents, case, and script-specific ordering rules. This guide explains how Unicode collation works, how it differs from simple byte-order sorting, and how to implement locale-sensitive text sorting using ICU and language Intl APIs.
ICU (International Components for Unicode) is the reference implementation library for Unicode and internationalization, providing collation, date/number formatting, transliteration, and text segmentation in C/C++ and Java. This guide introduces the ICU library, its most important components, and how to integrate it into C++, Java, and Python projects.
Unicode 16.0 added thousands of new characters, but there are still hundreds of scripts, historic languages, and symbols waiting for encoding — and the Consortium continues to evolve its processes. This article explores what characters and scripts are still missing from Unicode, the pipeline of pending proposals, and the long-term future of the standard.
Modern programming language designers must decide how to handle Unicode in identifiers, string literals, source file encoding, and operator symbols — decisions with deep implications for readability and security. This guide explores how Python, Rust, Swift, and other languages approach Unicode in their language specifications and the trade-offs involved.
Unicode normalization must often be applied at scale in search engines, databases, and text processing pipelines, where the performance cost of NFC vs NFD vs NFKC can matter significantly. This guide presents benchmarks of Unicode normalization performance across Python, JavaScript, Java, and Rust, with practical guidance for choosing the right form for high-throughput text workloads.