Advanced Topics — Guide Series

1

Unicode Collation: Sorting Text Correctly

Sorting text correctly across languages requires the Unicode Collation Algorithm (UCA), which defines a multi-level comparison scheme that handles accents, case, and script-specific ordering rules. This guide explains how Unicode collation works, how it differs from simple byte-order sorting, and how to implement locale-sensitive text sorting using ICU and language Intl APIs.

2

ICU Library: International Components for Unicode

ICU (International Components for Unicode) is the reference implementation library for Unicode and internationalization, providing collation, date/number formatting, transliteration, and text segmentation in C/C++ and Java. This guide introduces the ICU library, its most important components, and how to integrate it into C++, Java, and Python projects.

3

The Future of Unicode: What Comes After 16.0?

Unicode 16.0 added thousands of new characters, but there are still hundreds of scripts, historic languages, and symbols waiting for encoding — and the Consortium continues to evolve its processes. This article explores what characters and scripts are still missing from Unicode, the pipeline of pending proposals, and the long-term future of the standard.

4

Unicode in Compilers and Programming Language Design

Modern programming language designers must decide how to handle Unicode in identifiers, string literals, source file encoding, and operator symbols — decisions with deep implications for readability and security. This guide explores how Python, Rust, Swift, and other languages approach Unicode in their language specifications and the trade-offs involved.

5

Unicode Normalization Performance: Benchmarks

Unicode normalization must often be applied at scale in search engines, databases, and text processing pipelines, where the performance cost of NFC vs NFD vs NFKC can matter significantly. This guide presents benchmarks of Unicode normalization performance across Python, JavaScript, Java, and Rust, with practical guidance for choosing the right form for high-throughput text workloads.