Unicode for the Modern Web · Chapter 7
Internationalization: i18n Best Practices
Accept-Language, locale-aware formatting, pluralization, text direction, and language tags — this chapter provides the definitive guide to building web applications that work for every language.
Internationalization (i18n) and localization (l10n) are related but distinct disciplines. Internationalization is the engineering process of designing your software to support any locale without code changes — extracting strings, abstracting date/number formatting, handling text direction. Localization is the translation and cultural adaptation process applied to an internationalized codebase for a specific locale. This chapter focuses on the engineering side: the code patterns, APIs, and edge cases that make multilingual web applications work correctly.
The i18n / l10n Distinction
Think of i18n as building a stage with adjustable lighting, sound, and sets. Localization is configuring that stage for a specific performance. You do i18n once per feature; you do l10n for each language you add.
The most common i18n mistakes:
- Hardcoding English strings in templates instead of using translation keys
- Using + to concatenate translated strings (word order varies by language)
- Assuming all text is left-to-right
- Using Date.toLocaleDateString() without a locale argument
- Formatting numbers with toString() instead of Intl.NumberFormat
ICU MessageFormat
ICU (International Components for Unicode) MessageFormat is the industry standard for translatable strings that contain variables, plurals, and genders. A simple translation key:
# English
GREETING = "Hello, {name}!"
# French
GREETING = "Bonjour, {name} !" # note: space before !
But real messages need plurals, and plural rules differ dramatically across languages:
- English: 1 item, 2 items (two forms)
- French: 1 item, 2 items (two forms, but different threshold)
- Arabic: 0 items, 1 item, 2 items, 3–10 items, 11–99 items, 100+ items (six forms)
- Chinese/Japanese: 1 item (one form — no plural)
- Russian: 1 item, 2 items, 5 items, 11 items (four forms with complex rules)
ICU MessageFormat handles this:
# English ICU message
ITEM_COUNT = "{count, plural,
=0 {No items}
one {# item}
other {# items}
}"
# Arabic ICU message
ITEM_COUNT = "{count, plural,
zero {لا عناصر}
one {عنصر واحد}
two {عنصران}
few {# عناصر}
many {# عنصرًا}
other {# عنصر}
}"
The plural categories (zero, one, two, few, many, other) are defined by the Unicode CLDR project for each locale. You do not hardcode the thresholds — the ICU library applies the correct rules for the current locale.
// Using @formatjs/intl (React Intl / FormatJS)
import { useIntl } from 'react-intl';
function ItemCount({ count }) {
const intl = useIntl();
return <p>{intl.formatMessage(
{ id: 'ITEM_COUNT', defaultMessage: '{count, plural, one {# item} other {# items}}' },
{ count }
)}</p>;
}
Date and Time Formatting
Intl.DateTimeFormat provides locale-aware date formatting without any library dependencies:
const date = new Date('2024-03-15T14:30:00Z');
// Different locales, same date
const formats = ['en-US', 'de-DE', 'ja-JP', 'ar-SA', 'ko-KR'];
formats.forEach(locale => {
const fmt = new Intl.DateTimeFormat(locale, {
year: 'numeric', month: 'long', day: 'numeric',
hour: '2-digit', minute: '2-digit'
});
console.log(`${locale}: ${fmt.format(date)}`);
});
// en-US: March 15, 2024 at 02:30 PM
// de-DE: 15. März 2024 um 14:30
// ja-JP: 2024年3月15日 14時30分
// ar-SA: ١٥ مارس ٢٠٢٤ في ٢:٣٠ م
// ko-KR: 2024년 3월 15일 오후 2:30
Time zones are critical for correctness. Always store timestamps in UTC; convert to local time only at display time:
const fmt = new Intl.DateTimeFormat('en-US', {
timeZone: 'America/New_York',
dateStyle: 'full',
timeStyle: 'long'
});
fmt.format(new Date()); // Correctly displayed in Eastern Time
Python equivalent using babel (ICU-based):
from babel.dates import format_datetime
from datetime import datetime
import pytz
dt = datetime(2024, 3, 15, 14, 30, tzinfo=pytz.UTC)
eastern = pytz.timezone('America/New_York')
format_datetime(dt, locale='de_DE', tzinfo=eastern)
# '15. März 2024 um 10:30:00 EST'
Number Formatting
Numbers vary by locale in decimal separator, grouping separator, digit characters, and currency symbol placement:
const value = 1234567.89;
new Intl.NumberFormat('en-US').format(value); // "1,234,567.89"
new Intl.NumberFormat('de-DE').format(value); // "1.234.567,89"
new Intl.NumberFormat('fr-FR').format(value); // "1 234 567,89" (non-breaking space)
new Intl.NumberFormat('ar-EG').format(value); // "١٬٢٣٤٬٥٦٧٫٨٩" (Arabic-Indic digits)
new Intl.NumberFormat('hi-IN').format(value); // "12,34,567.89" (Indian grouping)
// Currency
new Intl.NumberFormat('ja-JP', { style: 'currency', currency: 'JPY' }).format(1500);
// "¥1,500"
new Intl.NumberFormat('de-DE', { style: 'currency', currency: 'EUR' }).format(1500);
// "1.500,00 €"
Note the Indian number grouping: 12,34,567 uses two-digit groups above the thousands place. An ASCII regex that validates "properly formatted numbers" will break for these locales.
Text Direction: dir="auto" and <bdi>
User-generated content can be in any direction. Using dir="auto" lets the browser apply Unicode bidirectional algorithm to determine the correct direction based on the first strongly-directional character:
<!-- User's name might be Arabic or English -->
<p dir="auto">{{ user.name }}</p>
<!-- Search result with unknown-direction title -->
<li dir="auto">{{ result.title }}</li>
When embedding bidirectional text in a larger sentence, use <bdi> (Bidirectional Isolation) to prevent the embedded text from affecting the surrounding layout:
<!-- Without bdi: "Posted by علي (500 points)" might render incorrectly -->
<p>Posted by <bdi>{{ username }}</bdi> ({{ points }} points)</p>
<!-- The username is isolated from the surrounding LTR text -->
The CSS equivalent is unicode-bidi: isolate. Prefer HTML <bdi> when the content will contain text from unknown languages.
Language Tags: BCP 47
Language tags follow BCP 47, which combines ISO 639 language codes with optional region, script, and variant subtags:
| Tag | Meaning |
|---|---|
en |
English |
en-US |
English (United States) |
zh-Hans |
Chinese (Simplified) |
zh-Hant-TW |
Chinese (Traditional, Taiwan) |
sr-Latn |
Serbian (Latin script) |
es-419 |
Spanish (Latin America) |
In HTML, declare the language on the root element for accessibility and search engine understanding:
<html lang="zh-Hans">
In HTTP, the Accept-Language header expresses the user's language preferences with quality values:
Accept-Language: ko-KR,ko;q=0.9,en-US;q=0.8,en;q=0.7
Parse this header to serve the best available language, not just split on commas:
from django.utils.translation import get_language_from_request
def my_view(request):
lang = get_language_from_request(request) # 'ko' from above header
# Django handles the preference matching against LANGUAGES setting
Message Catalogs: gettext and ICU
The two dominant message catalog formats:
gettext (.po/.mo files) — used by Django, WordPress, GNU software:
# locale/ko/LC_MESSAGES/django.po
msgid "Hello, %(name)s!"
msgstr "안녕하세요, %(name)s!"
msgid "You have %(count)d message."
msgid_plural "You have %(count)d messages."
msgstr[0] "메시지가 %(count)d개 있습니다."
# Django template
from django.utils.translation import ngettext, gettext as _
greeting = _("Hello, %(name)s!") % {'name': user.name}
msg = ngettext(
"You have %(count)d message.",
"You have %(count)d messages.",
count
) % {'count': count}
ICU MessageFormat (.json or .properties files) — used by FormatJS, Angular, Java:
{
"GREETING": "Hello, {name}!",
"MESSAGE_COUNT": "{count, plural, one {You have # message.} other {You have # messages.}}"
}
ICU is more expressive (it handles gender, select, and complex plural rules natively) but requires an ICU-capable library. gettext is simpler and has wider tooling support.
Testing with Pseudolocalization
Pseudolocalization is the practice of replacing all translatable strings with characters that look similar but belong to other scripts, combined with string expansion. It reveals i18n bugs without a human translator:
"Hello, World!" → "[Ĥéļļö, Ŵöŕļď!]"
The brackets catch truncated strings. The accented Latin characters catch encoding issues. Expanded strings (typically 30–40% longer than English) reveal layout problems — buttons that clip text, modals that overflow.
# Simple pseudolocalization function
def pseudolocalize(text: str) -> str:
CHAR_MAP = str.maketrans(
'abcdefghijklmnopqrstuvwxyz'
'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
'àbçdéfĝĥïĵķļmñöþqŕšţüvŵxÿž'
'ÀBÇDÉFĜĤÏĴĶĻMÑÖÞQŔŠŢÜVŴXŸŽ'
)
translated = text.translate(CHAR_MAP)
# Expand by ~30% to simulate verbose languages (German, Finnish)
expansion = "~" * (len(text) // 3)
return f"[{translated}{expansion}]"
pseudolocalize("Submit") # "[Šübmïţ~~]"
Running your CI suite with pseudolocalized strings as the active language catches broken string concatenation, missing i18n wrappers, and layout bugs before any real translator sees the product.
Putting It Together: i18n Checklist
Before launching a multilingual feature, verify:
[ ] All UI strings wrapped in translation function (no hardcoded English)
[ ] Pluralization uses ngettext/ICU plural, not if(count === 1)
[ ] Dates formatted with Intl.DateTimeFormat or locale-aware library
[ ] Numbers formatted with Intl.NumberFormat
[ ] User-generated content wrapped in <bdi> or has dir="auto"
[ ] HTML lang attribute set correctly
[ ] Accept-Language header parsed for language selection
[ ] Database columns are UTF-8 with appropriate collation
[ ] Tested with pseudolocalization in CI
[ ] RTL layout tested with an RTL locale (ar, he, fa, ur)
Internationalization is not a feature to be added later — retrofitting it into a large codebase is expensive. Build it in from the start, and every language you add becomes a straightforward content and translation task rather than an engineering project.