📜 Script Stories

Arabic Script Deep Dive

Arabic is the third most widely used writing system in the world, written right-to-left with letters that change shape depending on their position in a word — a complexity that Unicode handles with contextual shaping. This guide explores the Arabic script in Unicode, covering the Arabic block, presentation forms, bidirectional handling, and software support.

Published 2023-05-01 · Updated 2025-03-06

Arabic is one of the great writing systems of human civilization. Used by more than 400 million native speakers and over 1.8 billion Muslims who read the Quran, the Arabic script stretches across dozens of languages — from Arabic and Persian to Urdu, Pashto, Kurdish, Sindhi, and Malay (in the Jawi tradition). Its right-to-left directionality, contextual letter shaping, and rich typographic tradition make it one of the most complex scripts that Unicode must represent. This guide explores how Unicode encodes the Arabic script, how contextual shaping works, and what developers need to know to handle Arabic text correctly.

History and Reach

Arabic script descends from the Nabataean alphabet, itself derived from Aramaic. By the 7th century, the rapid spread of Islam carried the Arabic script across North Africa, the Middle East, Central Asia, and Southeast Asia. At its peak, Arabic script was used to write languages as distant as Ottoman Turkish, Swahili, and Bosnian.

Today, Arabic is the third most widely used script in the world (after Latin and Chinese). Languages that use Arabic script include:

Language	Speakers	Region
Arabic	400M+	Middle East, North Africa
Persian (Farsi)	110M+	Iran, Afghanistan, Tajikistan
Urdu	230M+	Pakistan, India
Pashto	60M+	Afghanistan, Pakistan
Kurdish (Sorani)	8M+	Iraq, Iran
Sindhi	30M+	Pakistan, India
Uyghur	12M+	China (Xinjiang)
Malay (Jawi)	15M+	Malaysia, Brunei

Each language adds its own letters to the base Arabic alphabet. Persian adds four letters (پ, چ, ژ, گ), Urdu adds more, and Sindhi has one of the largest Arabic-based alphabets with 52 letters.

The Arabic Alphabet

The base Arabic alphabet has 28 letters, all consonants. Short vowels are optionally indicated by diacritics called harakat (حركات). The three primary vowel marks are:

Mark	Name	Unicode	Sound
َ	Fathah	U+064E	/a/
ِ	Kasrah	U+0650	/i/
ُ	Dammah	U+064F	/u/
ْ	Sukun	U+0652	(no vowel)
ّ	Shaddah	U+0651	(gemination)

In everyday writing — newspapers, books, street signs — these diacritics are omitted. Native readers infer the vowels from context. They appear fully in the Quran, children's textbooks, and dictionaries.

Unicode Blocks for Arabic

Unicode allocates several blocks for the Arabic script:

Block	Range	Characters	Purpose
Arabic	U+0600–U+06FF	256	Core letters, harakat, digits
Arabic Supplement	U+0750–U+077F	48	Additional letters for African languages
Arabic Extended-A	U+08A0–U+08FF	96	Quranic annotations, Hanifi Rohingya
Arabic Extended-B	U+0870–U+089F	48	Additional characters
Arabic Extended-C	U+10EC0–U+10EFF	64	Additional characters
Arabic Presentation Forms-A	U+FB50–U+FDFF	688	Ligatures and contextual forms
Arabic Presentation Forms-B	U+FE70–U+FEFF	144	Positional forms

The core Arabic block (U+0600–U+06FF) is where almost all practical text encoding happens. The Presentation Forms blocks are a legacy from older encoding systems and should generally not be used in new text — more on that below.

Key Code Points

U+0627  ا  ARABIC LETTER ALEF
U+0628  ب  ARABIC LETTER BEH
U+062A  ت  ARABIC LETTER TEH
U+062B  ث  ARABIC LETTER THEH
U+062C  ج  ARABIC LETTER JEEM
U+062D  ح  ARABIC LETTER HAH
U+062E  خ  ARABIC LETTER KHAH
U+062F  د  ARABIC LETTER DAL
U+0631  ر  ARABIC LETTER REH
U+0633  س  ARABIC LETTER SEEN
U+0634  ش  ARABIC LETTER SHEEN
U+0635  ص  ARABIC LETTER SAD
U+0639  ع  ARABIC LETTER AIN
U+063A  غ  ARABIC LETTER GHAIN
U+0641  ف  ARABIC LETTER FEH
U+0642  ق  ARABIC LETTER QAF
U+0643  ك  ARABIC LETTER KAF
U+0644  ل  ARABIC LETTER LAM
U+0645  م  ARABIC LETTER MEEM
U+0646  ن  ARABIC LETTER NOON
U+0647  ه  ARABIC LETTER HEH
U+0648  و  ARABIC LETTER WAW
U+064A  ي  ARABIC LETTER YEH

Contextual Shaping: The Heart of Arabic

The defining feature of Arabic script is that every letter changes shape based on its position in a word. Each letter has up to four forms:

Position	Description	Example (Beh ب)
Isolated	Not connected to any letter	ب
Initial	Connected to the following letter only	بـ
Medial	Connected on both sides	ـبـ
Final	Connected to the preceding letter only	ـب

Some letters (like Dal د, Reh ر, Waw و, Alef ا) are non-joining on the left — they connect to the preceding letter but never to the following one. This means they only have isolated and final forms, and they break the cursive chain.

In Unicode, you store only the abstract letter (e.g., U+0628 for Beh). The rendering engine (via OpenType shaping) picks the correct glyph form based on context. This is fundamentally different from the legacy Presentation Forms approach, where each positional form had its own code point.

How OpenType Shaping Works

Modern text rendering uses the OpenType arab script tag and applies these shaping features in order:

ccmp — Compose and decompose characters
isol — Select isolated forms
init — Select initial forms
medi — Select medial forms
fina — Select final forms
rlig — Apply required ligatures (like Lam-Alef لا)
calt — Contextual alternates
liga — Standard ligatures
mkmk — Mark-to-mark positioning (stacking diacritics)

The Lam-Alef ligature (لا) is mandatory in Arabic typography — the combination of Lam (ل) followed by Alef (ا) must always render as a single connected form.

Why You Should Avoid Presentation Forms

The Arabic Presentation Forms-A (U+FB50–U+FDFF) and Forms-B (U+FE70–U+FEFF) blocks contain pre-shaped positional forms. For example:

Abstract	Isolated	Initial	Medial	Final
U+0628 ب	U+FE8F	U+FE91	U+FE92	U+FE90
U+062A ت	U+FE95	U+FE97	U+FE98	U+FE96

These exist for backward compatibility with older encoding standards. In modern Unicode text:

Always use abstract characters (U+0600–U+06FF range)
Never store presentation forms in data — they defeat searching, sorting, and normalization
If you encounter presentation forms, normalize them back to abstract characters

import unicodedata

# Normalize presentation form back to abstract character
text_with_pf = "\uFE91"  # ARABIC LETTER BEH INITIAL FORM
normalized = unicodedata.normalize("NFKC", text_with_pf)
print(f"U+{ord(normalized):04X}")  # U+0628 — the abstract BEH

Bidirectional Text (Bidi)

Arabic is written right-to-left (RTL), but Arabic text frequently contains left-to-right (LTR) content — numbers, Latin words, URLs, code snippets. The Unicode Bidirectional Algorithm (UBA, UAX #9) governs how these directions are mixed.

How Bidi Works

Every Unicode character has an intrinsic Bidi_Class property:

Class	Direction	Examples
R	Right-to-left	Arabic letters
AL	Arabic Letter	Arabic script characters
L	Left-to-right	Latin letters
EN	European Number	0-9 digits
AN	Arabic Number	Arabic-Indic digits (٠-٩)
WS	Whitespace	Space, tab
ON	Other Neutral	Punctuation, symbols

The UBA processes text in these phases:

Determine the paragraph embedding level (RTL if first strong char is Arabic)
Resolve explicit embedding levels (from override characters)
Resolve weak types (numbers, separators)
Resolve neutral types (punctuation, spaces)
Reorder characters for display

Common Bidi Pitfalls

Punctuation placement: In Arabic text, a period at the end of a sentence should appear on the left side. But if the sentence ends with a Latin word, the period can "jump" to the wrong side:

Expected:   .هذا هو Unicode   (period on left)
Bug:        هذا هو Unicode.   (period attached to Latin word)

Numbers and parentheses: Arabic uses two number systems — Western (0-9) and Arabic-Indic (٠-٩). Both are classified as LTR in the Bidi algorithm, which can cause unexpected reordering near RTL text.

Bidi Control Characters

Unicode provides explicit control characters for cases where the algorithm produces wrong results:

Character	Code Point	Purpose
RLM	U+200F	Right-to-left mark (invisible RTL strong character)
LRM	U+200E	Left-to-right mark (invisible LTR strong character)
RLE	U+202B	Right-to-left embedding (deprecated)
RLO	U+202E	Right-to-left override (deprecated)
RLI	U+2067	Right-to-left isolate (recommended)
PDI	U+2069	Pop directional isolate

The isolate characters (RLI, LRI, FSI, PDI) from Unicode 6.3 are preferred over the older embedding/override characters because they do not affect surrounding text.

HTML and CSS for Arabic

<!-- Set document direction -->
<html dir="rtl" lang="ar">

<!-- Isolate a specific passage -->
<p>النص العربي <bdi>Unicode 15.1</bdi> في النص.</p>

<!-- CSS logical properties (preferred over left/right) -->
<style>
  .arabic-text {
    direction: rtl;
    unicode-bidi: isolate;
    text-align: start;      /* Instead of "right" */
    margin-inline-start: 1rem; /* Instead of margin-left */
  }
</style>

Arabic Digits

Unicode includes two sets of digit characters used in Arabic-script contexts:

Name	Characters	Range	Used In
Arabic-Indic Digits	٠١٢٣٤٥٦٧٨٩	U+0660–U+0669	Egypt, Middle East
Extended Arabic-Indic Digits	۰۱۲۳۴۵۶۷۸۹	U+06F0–U+06F9	Iran, Pakistan (Persian, Urdu)

Many Arabic-speaking countries use Western digits (0-9) in digital contexts, while the Arabic-Indic forms appear in traditional and formal text.

Working with Arabic Text in Code

Python

# Arabic text is fully supported in Python 3 strings
greeting = "\u0645\u0631\u062D\u0628\u0627"  # مرحبا (Marhaba)
print(len(greeting))  # 5 — correct, one code point per letter

# Check if a character is Arabic
import unicodedata
for char in greeting:
    print(f"U+{ord(char):04X} {unicodedata.name(char)} "
          f"bidi={unicodedata.bidirectional(char)}")

# String reversal is visual, not logical — don't reverse Arabic strings
# for display purposes; the Bidi algorithm handles direction.

JavaScript

// Check Bidi direction
const text = "\u0645\u0631\u062D\u0628\u0627";

// Intl.Segmenter for proper grapheme handling
const segmenter = new Intl.Segmenter("ar", { granularity: "grapheme" });
const graphemes = [...segmenter.segment(text)];
console.log(graphemes.length); // 5

// Regex: match Arabic script characters
const arabicPattern = /\p{Script=Arabic}/u;
console.log(arabicPattern.test(text)); // true

Summary

Arabic script is one of the most beautiful and complex writing systems in Unicode. Its contextual shaping, right-to-left directionality, and extensive use across dozens of languages demand careful handling from developers and designers. The key takeaways are:

Use abstract characters (U+0600–U+06FF), never presentation forms
Let the shaping engine handle contextual forms — do not manually select positional glyphs
Test Bidi behavior with mixed Arabic and Latin text, especially around punctuation and numbers
Use CSS logical properties (inline-start, inline-end) instead of left/right for RTL layouts
Normalize text to NFC to ensure consistent representation of characters with diacritics
Support harakat (vowel marks) even if they are optional — they are essential for Quranic text, educational materials, and disambiguation