Writing Systems of the World · 第 2 章

The Arabic Script: Right-to-Left and Beyond

Arabic is the world's second most widely used writing system, with complex contextual shaping and right-to-left rendering. This chapter explores its beauty, technical challenges, and Unicode implementation.

~4500 字 · ~18 分钟阅读 · · Updated

Open a webpage in Arabic and something immediately strikes the eye: text flows right to left, letters connect fluidly within words, and the same letter changes shape depending on where it sits in a word. The Arabic script is among the most widely used writing systems in the world, serving not just Arabic but Persian (Farsi), Urdu, Pashto, Kurdish, and dozens of other languages across the Middle East, Central Asia, and South Asia. Understanding how Arabic works — and why it presents unique challenges for software engineers — requires grasping both its calligraphic beauty and its computational complexity.

The Script's Origins

Arabic script descends from the Aramaic branch of the Semitic writing tradition, via the Nabataean script of ancient Jordan. The earliest distinctively Arabic inscriptions date from around the 4th century CE. By the 7th century, when the Quran was revealed to the Prophet Muhammad, Arabic had a writing system — but one without vowel marks, used primarily by those already fluent in the language.

As Islam spread beyond the Arabian Peninsula, it carried the Arabic script with it. Non-Arab converts needed to read the Quran accurately, which required a way to mark vowels. This need produced harakat — the diacritical vowel marks added above and below letters — and a rich tradition of Quranic calligraphy that persists today.

Contextual Shaping: The Core Challenge

The defining feature of Arabic script for software rendering is contextual shaping: each letter has up to four different visual forms depending on its position within a word.

Form Context Example for ع (Ain)
Isolated Stands alone ع
Initial Begins a word عـ
Medial In the middle ـعـ
Final Ends a word ـع

In Unicode, the code points represent abstract letters, not specific visual forms. The rendering engine — the text shaping library — must determine the correct visual form based on context. This is fundamentally different from Latin, where letter shapes are fixed regardless of position.

The Unicode Arabic block (U+0600–U+06FF) contains the primary Arabic characters, supplemented by Arabic Supplement (U+0750–U+077F), Arabic Extended-A (U+08A0–U+08FF), Arabic Extended-B (U+0870–U+089F), and Arabic Presentation Forms blocks. The Presentation Forms blocks (U+FB50–U+FDFF and U+FE70–U+FEFF) contain precomposed contextual forms included for compatibility with older encodings — but modern practice uses the base characters and relies on shaping engines.

Connecting vs. Non-Connecting Letters

Not all Arabic letters connect to the following letter. Six letters (و, ز, ر, ذ, د, ا and their variants) only connect to the preceding letter — they have only isolated and final forms, no initial or medial. This means these letters always break the connection chain, forcing the following letter into initial or isolated form.

For example, the word باب (bāb, "door") contains the non-connecting ا: - ب (Ba) appears in initial form - ا (Alef) appears in final form (after Ba) - ب (Ba) appears in isolated form (because Alef doesn't connect forward)

Getting this logic right in rendering is non-trivial, especially for complex words with multiple non-connecting letters.

Harakat: Vowel Diacritics

Classical Arabic, Quranic text, children's books, and poetry use harakat — small marks that indicate short vowels and certain consonantal features. These are combining characters applied above or below base letters:

Diacritic Unicode Name Pronunciation
َ U+064E ARABIC FATHAH /a/ vowel
ِ U+0650 ARABIC KASRAH /i/ vowel
ُ U+064F ARABIC DAMMAH /u/ vowel
ً U+064B ARABIC FATHATAN /an/ (nunation)
ٍ U+064D ARABIC KASRATAN /in/ (nunation)
ٌ U+064C ARABIC DAMMATAN /un/ (nunation)
ْ U+0652 ARABIC SUKUN No vowel (consonant cluster)
ّ U+0651 ARABIC SHADDA Gemination (doubled consonant)

In everyday modern Arabic text, harakat are omitted. Readers are expected to know the correct vowels from context — a significant challenge for language learners and a reason that Arabic text-to-speech and NLP systems must handle vocalization explicitly.

The Bidirectional Algorithm

Arabic and Hebrew text in Unicode follows the Unicode Bidirectional Algorithm (UAX #9), which governs how text of mixed directionality is rendered. When an Arabic paragraph contains a URL, a product name in Latin, or a quoted English phrase, the rendering engine must correctly determine the visual order of characters.

The algorithm assigns each character a bidi category: - AL (Arabic Letter): right-to-left - L (Left-to-Right): Latin, digits in Latin context - AN (Arabic Number): Arabic-Indic digits - EN (European Number): Western Arabic digits (0–9) in RTL context - NSM (Non-Spacing Mark): inherits from base - B, S, WS: block separators, segment separators, whitespace

Special control characters manage explicit embedding: - U+202B RIGHT-TO-LEFT EMBEDDING (RLE) - U+202A LEFT-TO-RIGHT EMBEDDING (LRE) - U+202C POP DIRECTIONAL FORMATTING (PDF) - U+2067 RIGHT-TO-LEFT ISOLATE (RLI) — preferred modern approach - U+2066 LEFT-TO-RIGHT ISOLATE (LRI) - U+2069 POP DIRECTIONAL ISOLATE (PDI)

The "isolate" controls (added in Unicode 6.3, 2013) are preferred because they prevent "bidi spoofing" — a security vulnerability where malicious actors use bidi controls to make filenames or code appear different from their actual byte sequence.

Arabic for Non-Arabic Languages

Arabic script is used for over two dozen languages beyond Arabic itself, each requiring additional characters:

Persian (Farsi) adds four letters not in Arabic: - U+067E پ PE (p sound) - U+0686 چ TCHEH (ch sound) - U+0698 ژ JEH (zh sound) - U+06AF گ GAF (g sound)

Urdu adds further characters for South Asian sounds: - U+06BA ں NOON GHUNNA (nasalization) - U+06BE ھ HEH DOACHASHMEE (aspirated h) - U+0679 ٹ TTEH (retroflex t)

N'Ko, a script created in 1949 for Manding languages of West Africa, uses the Arabic block's right-to-left framework but has its own dedicated Unicode block (U+07C0–U+07FF).

These extensions mean that a "universal Arabic keyboard" is a chimera — Urdu, Pashto, and Persian typists all need different key layouts even though they write in what appears to be the same script.

Rendering Engines and Text Shaping

Correct Arabic rendering requires a sophisticated text shaping engine. The major options:

  • HarfBuzz: The open-source shaping engine used by Android, Linux (via FreeType/Cairo), Firefox, Chrome, and most modern software. Implements the full Unicode Arabic shaping algorithm plus OpenType GSUB/GPOS tables.
  • Uniscribe: Microsoft's Windows text shaping engine, which handles Arabic for Windows applications.
  • Core Text: Apple's text rendering framework, used on macOS and iOS.

Text shaping operates in three major stages: 1. Reordering: Characters are reordered from logical (storage) order to visual (display) order according to the bidi algorithm. 2. Shaping: Contextual letter forms are selected based on surrounding characters. 3. Positioning: Diacritics and combining marks are correctly placed above/below base letters, potentially stacking multiple marks.

Web developers working with Arabic must ensure their HTML is properly marked up with dir="rtl" on the html or body element, or use the CSS property direction: rtl. The lang="ar" (or lang="fa", lang="ur") attribute is equally important — it allows browsers to select language-appropriate OpenType features and line-breaking behavior.

The Beauty and the Bug

Arabic calligraphy is among the world's great art forms, with named styles like Naskh, Thuluth, Diwani, Kufic, and Nastaliq. Nastaliq — the traditional style for Urdu and Persian — is particularly challenging to render digitally because of its dramatically sloping baseline and complex ligature system. Early Unicode-era Urdu computing often fell back to Naskh, considered less elegant by Urdu speakers, simply because Nastaliq shaping was too complex for available engines.

Today, specialized OpenType fonts like Noto Nastaliq Urdu and Alef implement the full Nastaliq rendering rules. The fact that a historical calligraphic style running on 21st-century rendering engines across billions of mobile phones is even possible speaks to the extraordinary depth of Unicode's Arabic infrastructure.