ICU Library: International Components for Unicode
ICU (International Components for Unicode) is the reference implementation library for Unicode and internationalization, providing collation, date/number formatting, transliteration, and text segmentation in C/C++ and Java. This guide introduces the ICU library, its most important components, and how to integrate it into C++, Java, and Python projects.
Nearly every modern computing platform that handles text correctly does so partly because of ICU — the International Components for Unicode library. ICU provides production-quality implementations of Unicode and internationalization standards for C, C++, and Java, and its influence extends into virtually every operating system, runtime, and programming language that handles global text correctly.
History: From IBM to Open Source
ICU traces its origins to Taligent, a joint venture between Apple and IBM in the early 1990s. When Taligent wound down, IBM continued developing the internationalization libraries internally, initially as "ICU4J" (Java) and "ICU4C" (C/C++). In the early 2000s, IBM open-sourced ICU under what is now an MIT-style license maintained by the Unicode Consortium and IBM as the primary stewards.
Today, ICU is developed as a collaborative open-source project at icu.unicode.org. The C/C++ and Java versions are maintained in sync, implementing the same algorithms and data (primarily sourced from CLDR) with parallel APIs.
Key Components
ICU is a large library covering many internationalization concerns. Its core components include:
BreakIterator
BreakIterator implements the Unicode Line Breaking Algorithm (UAX #14), Word Break (UAX #29), Sentence Break, and Grapheme Cluster Break rules. It lets you correctly find the boundaries between characters, words, sentences, and line-break opportunities in a string — something that naive index arithmetic gets wrong for almost all non-Latin scripts.
// C++ example: word boundaries
icu::BreakIterator* bi = icu::BreakIterator::createWordInstance(
icu::Locale::getUS(), status);
bi->setText("Hello, world! This is ICU.");
int32_t pos = bi->first();
while (pos != icu::BreakIterator::DONE) {
// pos is a word boundary
pos = bi->next();
}
Collator
Collator implements the Unicode Collation Algorithm with CLDR locale tailorings. It generates locale-aware sort keys and provides string comparison functions suitable for sorting and searching.
Normalizer2
Normalizer2 implements Unicode normalization (NFC, NFD, NFKC, NFKD) efficiently using the Quick_Check optimization to skip strings that are already in normal form. It also supports custom normalization mappings.
Transliterator
Transliterator converts text between scripts or performs rule-based text transformations. It includes built-in rules for Latin ↔ Cyrillic, Latin ↔ Arabic, Latin ↔ Devanagari, Hiragana ↔ Katakana, and many more. The rule system is powerful enough to express complex many-to-many script conversions.
// Java: transliterate Latin to Devanagari
Transliterator t = Transliterator.getInstance("Latin-Devanagari");
String result = t.transliterate("namaste");
// → "नमस्ते"
NumberFormat and DecimalFormat
These format numbers according to locale conventions: decimal separators (period vs comma), grouping separators, currency symbols and placement, percent notation, and locale-specific numeral systems (Arabic-Indic digits, for example).
DateFormat and Calendar
ICU provides locale-aware date and time formatting covering the Gregorian, Islamic, Hebrew, Buddhist, Japanese imperial, and other calendar systems. It handles time zone rules from the IANA Time Zone Database, including historical zone transitions.
MessageFormat
MessageFormat supports locale-sensitive message interpolation including plural rules (ICU's CLDR-based plural rules handle the complex plural forms of languages like Russian, Arabic, and Polish) and select rules (for grammatical gender and other categories).
How Major Platforms Use ICU Internally
ICU is so fundamental that it is embedded in most of the software you use daily:
Android: ICU4C is bundled directly with Android (since Android 7.0 / Nougat, exposed as the public android.icu.* API surface). All locale-aware operations in Android's Java framework delegate to ICU4C underneath.
macOS and iOS: Apple's CoreFoundation and Foundation frameworks use ICU internally for Unicode normalization, collation, break iteration, and more. The system ICU version is updated with OS releases.
Node.js and V8: The V8 JavaScript engine embeds ICU4C for Intl object support — Intl.Collator, Intl.DateTimeFormat, Intl.NumberFormat, and Intl.Segmenter all delegate to ICU underneath. Node.js ships with a full ICU data file by default since v13.
Python: The PyICU package provides Python bindings to ICU4C. CPython's own locale module is much more limited; PyICU is needed for production-quality internationalization.
Chromium/Chrome: Chromium bundles ICU4C as a third-party dependency for text layout, collation, and Intl support.
Firefox: Similarly embeds ICU4C.
Basic Usage Examples
C++ normalization:
#include "unicode/normalizer2.h"
UErrorCode status = U_ZERO_ERROR;
const icu::Normalizer2* nfc =
icu::Normalizer2::getNFCInstance(status);
icu::UnicodeString input = u"caf\\u0065\\u0301"; // "cafe" + combining acute
icu::UnicodeString normalized = nfc->normalize(input, status);
// normalized == u"caf\\u00E9" (precomposed é)
Java collation:
import com.ibm.icu.text.Collator;
import com.ibm.icu.util.ULocale;
Collator collator = Collator.getInstance(new ULocale("sv_SE"));
collator.setStrength(Collator.SECONDARY); // accent-sensitive, case-insensitive
int result = collator.compare("zebra", "\\u00E4rlig"); // ärlig
// result > 0 because ä sorts after z in Swedish
ICU4X: The Rust Rewrite
ICU4C and ICU4J, while extremely capable, carry decades of accumulated API surface and face challenges in modern constrained environments (WebAssembly, embedded systems, mobile). The Unicode Consortium initiated ICU4X as a ground-up rewrite in Rust, designed with different priorities:
- Modular: data and code are separated so clients can ship only the locale data they need.
- WASM-friendly: compiles to small WebAssembly modules for use in browsers and edge functions.
- Zero-copy data loading: locale data can be embedded in the binary or loaded from a buffer without allocation-heavy parsing.
- Multilingual clients: FFI bindings allow use from JavaScript, Python, C, C++, and Dart.
ICU4X reached version 1.0 in 2023 and is production-ready for an expanding set of components. It is not yet a complete drop-in replacement for ICU4C/ICU4J, but it represents the future direction for Unicode internationalization infrastructure.
When to Use ICU
Use ICU (or a library that wraps it, like platform Intl APIs) whenever you need to:
- Sort user-visible lists of names or words in a locale-aware way
- Find word, sentence, or line boundaries in text
- Format numbers, dates, currencies, or messages for a locale
- Normalize Unicode strings for storage or comparison
- Transliterate between scripts
For most application code, the platform's built-in Intl or java.text APIs are sufficient, as they delegate to ICU underneath. When you need more control — custom tailoring rules, specific ICU versions, or features not exposed through standard APIs — using ICU directly via PyICU, JNI, or C++ gives you the full capability of the library.
المزيد في Advanced Topics
Sorting text correctly across languages requires the Unicode Collation Algorithm (UCA), which …
Unicode 16.0 added thousands of new characters, but there are still hundreds …
Modern programming language designers must decide how to handle Unicode in identifiers, …
Unicode normalization must often be applied at scale in search engines, databases, …