📜 Script Stories

Arabic Script Deep Dive

Arabic is the third most widely used writing system in the world, written right-to-left with letters that change shape depending on their position in a word — a complexity that Unicode handles with contextual shaping. This guide explores the Arabic script in Unicode, covering the Arabic block, presentation forms, bidirectional handling, and software support.

·

Arabic is one of the great writing systems of human civilization. Used by more than 400 million native speakers and over 1.8 billion Muslims who read the Quran, the Arabic script stretches across dozens of languages — from Arabic and Persian to Urdu, Pashto, Kurdish, Sindhi, and Malay (in the Jawi tradition). Its right-to-left directionality, contextual letter shaping, and rich typographic tradition make it one of the most complex scripts that Unicode must represent. This guide explores how Unicode encodes the Arabic script, how contextual shaping works, and what developers need to know to handle Arabic text correctly.

History and Reach

Arabic script descends from the Nabataean alphabet, itself derived from Aramaic. By the 7th century, the rapid spread of Islam carried the Arabic script across North Africa, the Middle East, Central Asia, and Southeast Asia. At its peak, Arabic script was used to write languages as distant as Ottoman Turkish, Swahili, and Bosnian.

Today, Arabic is the third most widely used script in the world (after Latin and Chinese). Languages that use Arabic script include:

Language Speakers Region
Arabic 400M+ Middle East, North Africa
Persian (Farsi) 110M+ Iran, Afghanistan, Tajikistan
Urdu 230M+ Pakistan, India
Pashto 60M+ Afghanistan, Pakistan
Kurdish (Sorani) 8M+ Iraq, Iran
Sindhi 30M+ Pakistan, India
Uyghur 12M+ China (Xinjiang)
Malay (Jawi) 15M+ Malaysia, Brunei

Each language adds its own letters to the base Arabic alphabet. Persian adds four letters (پ, چ, ژ, گ), Urdu adds more, and Sindhi has one of the largest Arabic-based alphabets with 52 letters.

The Arabic Alphabet

The base Arabic alphabet has 28 letters, all consonants. Short vowels are optionally indicated by diacritics called harakat (حركات). The three primary vowel marks are:

Mark Name Unicode Sound
َ Fathah U+064E /a/
ِ Kasrah U+0650 /i/
ُ Dammah U+064F /u/
ْ Sukun U+0652 (no vowel)
ّ Shaddah U+0651 (gemination)

In everyday writing — newspapers, books, street signs — these diacritics are omitted. Native readers infer the vowels from context. They appear fully in the Quran, children's textbooks, and dictionaries.

Unicode Blocks for Arabic

Unicode allocates several blocks for the Arabic script:

Block Range Characters Purpose
Arabic U+0600–U+06FF 256 Core letters, harakat, digits
Arabic Supplement U+0750–U+077F 48 Additional letters for African languages
Arabic Extended-A U+08A0–U+08FF 96 Quranic annotations, Hanifi Rohingya
Arabic Extended-B U+0870–U+089F 48 Additional characters
Arabic Extended-C U+10EC0–U+10EFF 64 Additional characters
Arabic Presentation Forms-A U+FB50–U+FDFF 688 Ligatures and contextual forms
Arabic Presentation Forms-B U+FE70–U+FEFF 144 Positional forms

The core Arabic block (U+0600–U+06FF) is where almost all practical text encoding happens. The Presentation Forms blocks are a legacy from older encoding systems and should generally not be used in new text — more on that below.

Key Code Points

U+0627  ا  ARABIC LETTER ALEF
U+0628  ب  ARABIC LETTER BEH
U+062A  ت  ARABIC LETTER TEH
U+062B  ث  ARABIC LETTER THEH
U+062C  ج  ARABIC LETTER JEEM
U+062D  ح  ARABIC LETTER HAH
U+062E  خ  ARABIC LETTER KHAH
U+062F  د  ARABIC LETTER DAL
U+0631  ر  ARABIC LETTER REH
U+0633  س  ARABIC LETTER SEEN
U+0634  ش  ARABIC LETTER SHEEN
U+0635  ص  ARABIC LETTER SAD
U+0639  ع  ARABIC LETTER AIN
U+063A  غ  ARABIC LETTER GHAIN
U+0641  ف  ARABIC LETTER FEH
U+0642  ق  ARABIC LETTER QAF
U+0643  ك  ARABIC LETTER KAF
U+0644  ل  ARABIC LETTER LAM
U+0645  م  ARABIC LETTER MEEM
U+0646  ن  ARABIC LETTER NOON
U+0647  ه  ARABIC LETTER HEH
U+0648  و  ARABIC LETTER WAW
U+064A  ي  ARABIC LETTER YEH

Contextual Shaping: The Heart of Arabic

The defining feature of Arabic script is that every letter changes shape based on its position in a word. Each letter has up to four forms:

Position Description Example (Beh ب)
Isolated Not connected to any letter ب
Initial Connected to the following letter only بـ
Medial Connected on both sides ـبـ
Final Connected to the preceding letter only ـب

Some letters (like Dal د, Reh ر, Waw و, Alef ا) are non-joining on the left — they connect to the preceding letter but never to the following one. This means they only have isolated and final forms, and they break the cursive chain.

In Unicode, you store only the abstract letter (e.g., U+0628 for Beh). The rendering engine (via OpenType shaping) picks the correct glyph form based on context. This is fundamentally different from the legacy Presentation Forms approach, where each positional form had its own code point.

How OpenType Shaping Works

Modern text rendering uses the OpenType arab script tag and applies these shaping features in order:

  1. ccmp — Compose and decompose characters
  2. isol — Select isolated forms
  3. init — Select initial forms
  4. medi — Select medial forms
  5. fina — Select final forms
  6. rlig — Apply required ligatures (like Lam-Alef لا)
  7. calt — Contextual alternates
  8. liga — Standard ligatures
  9. mkmk — Mark-to-mark positioning (stacking diacritics)

The Lam-Alef ligature (لا) is mandatory in Arabic typography — the combination of Lam (ل) followed by Alef (ا) must always render as a single connected form.

Why You Should Avoid Presentation Forms

The Arabic Presentation Forms-A (U+FB50–U+FDFF) and Forms-B (U+FE70–U+FEFF) blocks contain pre-shaped positional forms. For example:

Abstract Isolated Initial Medial Final
U+0628 ب U+FE8F U+FE91 U+FE92 U+FE90
U+062A ت U+FE95 U+FE97 U+FE98 U+FE96

These exist for backward compatibility with older encoding standards. In modern Unicode text:

  • Always use abstract characters (U+0600–U+06FF range)
  • Never store presentation forms in data — they defeat searching, sorting, and normalization
  • If you encounter presentation forms, normalize them back to abstract characters
import unicodedata

# Normalize presentation form back to abstract character
text_with_pf = "\uFE91"  # ARABIC LETTER BEH INITIAL FORM
normalized = unicodedata.normalize("NFKC", text_with_pf)
print(f"U+{ord(normalized):04X}")  # U+0628 — the abstract BEH

Bidirectional Text (Bidi)

Arabic is written right-to-left (RTL), but Arabic text frequently contains left-to-right (LTR) content — numbers, Latin words, URLs, code snippets. The Unicode Bidirectional Algorithm (UBA, UAX #9) governs how these directions are mixed.

How Bidi Works

Every Unicode character has an intrinsic Bidi_Class property:

Class Direction Examples
R Right-to-left Arabic letters
AL Arabic Letter Arabic script characters
L Left-to-right Latin letters
EN European Number 0-9 digits
AN Arabic Number Arabic-Indic digits (٠-٩)
WS Whitespace Space, tab
ON Other Neutral Punctuation, symbols

The UBA processes text in these phases:

  1. Determine the paragraph embedding level (RTL if first strong char is Arabic)
  2. Resolve explicit embedding levels (from override characters)
  3. Resolve weak types (numbers, separators)
  4. Resolve neutral types (punctuation, spaces)
  5. Reorder characters for display

Common Bidi Pitfalls

Punctuation placement: In Arabic text, a period at the end of a sentence should appear on the left side. But if the sentence ends with a Latin word, the period can "jump" to the wrong side:

Expected:   .هذا هو Unicode   (period on left)
Bug:        هذا هو Unicode.   (period attached to Latin word)

Numbers and parentheses: Arabic uses two number systems — Western (0-9) and Arabic-Indic (٠-٩). Both are classified as LTR in the Bidi algorithm, which can cause unexpected reordering near RTL text.

Bidi Control Characters

Unicode provides explicit control characters for cases where the algorithm produces wrong results:

Character Code Point Purpose
RLM U+200F Right-to-left mark (invisible RTL strong character)
LRM U+200E Left-to-right mark (invisible LTR strong character)
RLE U+202B Right-to-left embedding (deprecated)
RLO U+202E Right-to-left override (deprecated)
RLI U+2067 Right-to-left isolate (recommended)
PDI U+2069 Pop directional isolate

The isolate characters (RLI, LRI, FSI, PDI) from Unicode 6.3 are preferred over the older embedding/override characters because they do not affect surrounding text.

HTML and CSS for Arabic

<!-- Set document direction -->
<html dir="rtl" lang="ar">

<!-- Isolate a specific passage -->
<p>النص العربي <bdi>Unicode 15.1</bdi> في النص.</p>

<!-- CSS logical properties (preferred over left/right) -->
<style>
  .arabic-text {
    direction: rtl;
    unicode-bidi: isolate;
    text-align: start;      /* Instead of "right" */
    margin-inline-start: 1rem; /* Instead of margin-left */
  }
</style>

Arabic Digits

Unicode includes two sets of digit characters used in Arabic-script contexts:

Name Characters Range Used In
Arabic-Indic Digits ٠١٢٣٤٥٦٧٨٩ U+0660–U+0669 Egypt, Middle East
Extended Arabic-Indic Digits ۰۱۲۳۴۵۶۷۸۹ U+06F0–U+06F9 Iran, Pakistan (Persian, Urdu)

Many Arabic-speaking countries use Western digits (0-9) in digital contexts, while the Arabic-Indic forms appear in traditional and formal text.

Working with Arabic Text in Code

Python

# Arabic text is fully supported in Python 3 strings
greeting = "\u0645\u0631\u062D\u0628\u0627"  # مرحبا (Marhaba)
print(len(greeting))  # 5 — correct, one code point per letter

# Check if a character is Arabic
import unicodedata
for char in greeting:
    print(f"U+{ord(char):04X} {unicodedata.name(char)} "
          f"bidi={unicodedata.bidirectional(char)}")

# String reversal is visual, not logical — don't reverse Arabic strings
# for display purposes; the Bidi algorithm handles direction.

JavaScript

// Check Bidi direction
const text = "\u0645\u0631\u062D\u0628\u0627";

// Intl.Segmenter for proper grapheme handling
const segmenter = new Intl.Segmenter("ar", { granularity: "grapheme" });
const graphemes = [...segmenter.segment(text)];
console.log(graphemes.length); // 5

// Regex: match Arabic script characters
const arabicPattern = /\p{Script=Arabic}/u;
console.log(arabicPattern.test(text)); // true

Summary

Arabic script is one of the most beautiful and complex writing systems in Unicode. Its contextual shaping, right-to-left directionality, and extensive use across dozens of languages demand careful handling from developers and designers. The key takeaways are:

  1. Use abstract characters (U+0600–U+06FF), never presentation forms
  2. Let the shaping engine handle contextual forms — do not manually select positional glyphs
  3. Test Bidi behavior with mixed Arabic and Latin text, especially around punctuation and numbers
  4. Use CSS logical properties (inline-start, inline-end) instead of left/right for RTL layouts
  5. Normalize text to NFC to ensure consistent representation of characters with diacritics
  6. Support harakat (vowel marks) even if they are optional — they are essential for Quranic text, educational materials, and disambiguation

Mais em Script Stories