Arabic Script Deep Dive
Arabic is the third most widely used writing system in the world, written right-to-left with letters that change shape depending on their position in a word — a complexity that Unicode handles with contextual shaping. This guide explores the Arabic script in Unicode, covering the Arabic block, presentation forms, bidirectional handling, and software support.
Arabic is one of the great writing systems of human civilization. Used by more than 400 million native speakers and over 1.8 billion Muslims who read the Quran, the Arabic script stretches across dozens of languages — from Arabic and Persian to Urdu, Pashto, Kurdish, Sindhi, and Malay (in the Jawi tradition). Its right-to-left directionality, contextual letter shaping, and rich typographic tradition make it one of the most complex scripts that Unicode must represent. This guide explores how Unicode encodes the Arabic script, how contextual shaping works, and what developers need to know to handle Arabic text correctly.
History and Reach
Arabic script descends from the Nabataean alphabet, itself derived from Aramaic. By the 7th century, the rapid spread of Islam carried the Arabic script across North Africa, the Middle East, Central Asia, and Southeast Asia. At its peak, Arabic script was used to write languages as distant as Ottoman Turkish, Swahili, and Bosnian.
Today, Arabic is the third most widely used script in the world (after Latin and Chinese). Languages that use Arabic script include:
| Language | Speakers | Region |
|---|---|---|
| Arabic | 400M+ | Middle East, North Africa |
| Persian (Farsi) | 110M+ | Iran, Afghanistan, Tajikistan |
| Urdu | 230M+ | Pakistan, India |
| Pashto | 60M+ | Afghanistan, Pakistan |
| Kurdish (Sorani) | 8M+ | Iraq, Iran |
| Sindhi | 30M+ | Pakistan, India |
| Uyghur | 12M+ | China (Xinjiang) |
| Malay (Jawi) | 15M+ | Malaysia, Brunei |
Each language adds its own letters to the base Arabic alphabet. Persian adds four letters (پ, چ, ژ, گ), Urdu adds more, and Sindhi has one of the largest Arabic-based alphabets with 52 letters.
The Arabic Alphabet
The base Arabic alphabet has 28 letters, all consonants. Short vowels are optionally indicated by diacritics called harakat (حركات). The three primary vowel marks are:
| Mark | Name | Unicode | Sound |
|---|---|---|---|
| َ | Fathah | U+064E | /a/ |
| ِ | Kasrah | U+0650 | /i/ |
| ُ | Dammah | U+064F | /u/ |
| ْ | Sukun | U+0652 | (no vowel) |
| ّ | Shaddah | U+0651 | (gemination) |
In everyday writing — newspapers, books, street signs — these diacritics are omitted. Native readers infer the vowels from context. They appear fully in the Quran, children's textbooks, and dictionaries.
Unicode Blocks for Arabic
Unicode allocates several blocks for the Arabic script:
| Block | Range | Characters | Purpose |
|---|---|---|---|
| Arabic | U+0600–U+06FF | 256 | Core letters, harakat, digits |
| Arabic Supplement | U+0750–U+077F | 48 | Additional letters for African languages |
| Arabic Extended-A | U+08A0–U+08FF | 96 | Quranic annotations, Hanifi Rohingya |
| Arabic Extended-B | U+0870–U+089F | 48 | Additional characters |
| Arabic Extended-C | U+10EC0–U+10EFF | 64 | Additional characters |
| Arabic Presentation Forms-A | U+FB50–U+FDFF | 688 | Ligatures and contextual forms |
| Arabic Presentation Forms-B | U+FE70–U+FEFF | 144 | Positional forms |
The core Arabic block (U+0600–U+06FF) is where almost all practical text encoding happens. The Presentation Forms blocks are a legacy from older encoding systems and should generally not be used in new text — more on that below.
Key Code Points
U+0627 ا ARABIC LETTER ALEF
U+0628 ب ARABIC LETTER BEH
U+062A ت ARABIC LETTER TEH
U+062B ث ARABIC LETTER THEH
U+062C ج ARABIC LETTER JEEM
U+062D ح ARABIC LETTER HAH
U+062E خ ARABIC LETTER KHAH
U+062F د ARABIC LETTER DAL
U+0631 ر ARABIC LETTER REH
U+0633 س ARABIC LETTER SEEN
U+0634 ش ARABIC LETTER SHEEN
U+0635 ص ARABIC LETTER SAD
U+0639 ع ARABIC LETTER AIN
U+063A غ ARABIC LETTER GHAIN
U+0641 ف ARABIC LETTER FEH
U+0642 ق ARABIC LETTER QAF
U+0643 ك ARABIC LETTER KAF
U+0644 ل ARABIC LETTER LAM
U+0645 م ARABIC LETTER MEEM
U+0646 ن ARABIC LETTER NOON
U+0647 ه ARABIC LETTER HEH
U+0648 و ARABIC LETTER WAW
U+064A ي ARABIC LETTER YEH
Contextual Shaping: The Heart of Arabic
The defining feature of Arabic script is that every letter changes shape based on its position in a word. Each letter has up to four forms:
| Position | Description | Example (Beh ب) |
|---|---|---|
| Isolated | Not connected to any letter | ب |
| Initial | Connected to the following letter only | بـ |
| Medial | Connected on both sides | ـبـ |
| Final | Connected to the preceding letter only | ـب |
Some letters (like Dal د, Reh ر, Waw و, Alef ا) are non-joining on the left — they connect to the preceding letter but never to the following one. This means they only have isolated and final forms, and they break the cursive chain.
In Unicode, you store only the abstract letter (e.g., U+0628 for Beh). The rendering engine (via OpenType shaping) picks the correct glyph form based on context. This is fundamentally different from the legacy Presentation Forms approach, where each positional form had its own code point.
How OpenType Shaping Works
Modern text rendering uses the OpenType arab script tag and applies these shaping
features in order:
- ccmp — Compose and decompose characters
- isol — Select isolated forms
- init — Select initial forms
- medi — Select medial forms
- fina — Select final forms
- rlig — Apply required ligatures (like Lam-Alef لا)
- calt — Contextual alternates
- liga — Standard ligatures
- mkmk — Mark-to-mark positioning (stacking diacritics)
The Lam-Alef ligature (لا) is mandatory in Arabic typography — the combination of Lam (ل) followed by Alef (ا) must always render as a single connected form.
Why You Should Avoid Presentation Forms
The Arabic Presentation Forms-A (U+FB50–U+FDFF) and Forms-B (U+FE70–U+FEFF) blocks contain pre-shaped positional forms. For example:
| Abstract | Isolated | Initial | Medial | Final |
|---|---|---|---|---|
| U+0628 ب | U+FE8F | U+FE91 | U+FE92 | U+FE90 |
| U+062A ت | U+FE95 | U+FE97 | U+FE98 | U+FE96 |
These exist for backward compatibility with older encoding standards. In modern Unicode text:
- Always use abstract characters (U+0600–U+06FF range)
- Never store presentation forms in data — they defeat searching, sorting, and normalization
- If you encounter presentation forms, normalize them back to abstract characters
import unicodedata
# Normalize presentation form back to abstract character
text_with_pf = "\uFE91" # ARABIC LETTER BEH INITIAL FORM
normalized = unicodedata.normalize("NFKC", text_with_pf)
print(f"U+{ord(normalized):04X}") # U+0628 — the abstract BEH
Bidirectional Text (Bidi)
Arabic is written right-to-left (RTL), but Arabic text frequently contains left-to-right (LTR) content — numbers, Latin words, URLs, code snippets. The Unicode Bidirectional Algorithm (UBA, UAX #9) governs how these directions are mixed.
How Bidi Works
Every Unicode character has an intrinsic Bidi_Class property:
| Class | Direction | Examples |
|---|---|---|
| R | Right-to-left | Arabic letters |
| AL | Arabic Letter | Arabic script characters |
| L | Left-to-right | Latin letters |
| EN | European Number | 0-9 digits |
| AN | Arabic Number | Arabic-Indic digits (٠-٩) |
| WS | Whitespace | Space, tab |
| ON | Other Neutral | Punctuation, symbols |
The UBA processes text in these phases:
- Determine the paragraph embedding level (RTL if first strong char is Arabic)
- Resolve explicit embedding levels (from override characters)
- Resolve weak types (numbers, separators)
- Resolve neutral types (punctuation, spaces)
- Reorder characters for display
Common Bidi Pitfalls
Punctuation placement: In Arabic text, a period at the end of a sentence should appear on the left side. But if the sentence ends with a Latin word, the period can "jump" to the wrong side:
Expected: .هذا هو Unicode (period on left)
Bug: هذا هو Unicode. (period attached to Latin word)
Numbers and parentheses: Arabic uses two number systems — Western (0-9) and Arabic-Indic (٠-٩). Both are classified as LTR in the Bidi algorithm, which can cause unexpected reordering near RTL text.
Bidi Control Characters
Unicode provides explicit control characters for cases where the algorithm produces wrong results:
| Character | Code Point | Purpose |
|---|---|---|
| RLM | U+200F | Right-to-left mark (invisible RTL strong character) |
| LRM | U+200E | Left-to-right mark (invisible LTR strong character) |
| RLE | U+202B | Right-to-left embedding (deprecated) |
| RLO | U+202E | Right-to-left override (deprecated) |
| RLI | U+2067 | Right-to-left isolate (recommended) |
| PDI | U+2069 | Pop directional isolate |
The isolate characters (RLI, LRI, FSI, PDI) from Unicode 6.3 are preferred over the older embedding/override characters because they do not affect surrounding text.
HTML and CSS for Arabic
<!-- Set document direction -->
<html dir="rtl" lang="ar">
<!-- Isolate a specific passage -->
<p>النص العربي <bdi>Unicode 15.1</bdi> في النص.</p>
<!-- CSS logical properties (preferred over left/right) -->
<style>
.arabic-text {
direction: rtl;
unicode-bidi: isolate;
text-align: start; /* Instead of "right" */
margin-inline-start: 1rem; /* Instead of margin-left */
}
</style>
Arabic Digits
Unicode includes two sets of digit characters used in Arabic-script contexts:
| Name | Characters | Range | Used In |
|---|---|---|---|
| Arabic-Indic Digits | ٠١٢٣٤٥٦٧٨٩ | U+0660–U+0669 | Egypt, Middle East |
| Extended Arabic-Indic Digits | ۰۱۲۳۴۵۶۷۸۹ | U+06F0–U+06F9 | Iran, Pakistan (Persian, Urdu) |
Many Arabic-speaking countries use Western digits (0-9) in digital contexts, while the Arabic-Indic forms appear in traditional and formal text.
Working with Arabic Text in Code
Python
# Arabic text is fully supported in Python 3 strings
greeting = "\u0645\u0631\u062D\u0628\u0627" # مرحبا (Marhaba)
print(len(greeting)) # 5 — correct, one code point per letter
# Check if a character is Arabic
import unicodedata
for char in greeting:
print(f"U+{ord(char):04X} {unicodedata.name(char)} "
f"bidi={unicodedata.bidirectional(char)}")
# String reversal is visual, not logical — don't reverse Arabic strings
# for display purposes; the Bidi algorithm handles direction.
JavaScript
// Check Bidi direction
const text = "\u0645\u0631\u062D\u0628\u0627";
// Intl.Segmenter for proper grapheme handling
const segmenter = new Intl.Segmenter("ar", { granularity: "grapheme" });
const graphemes = [...segmenter.segment(text)];
console.log(graphemes.length); // 5
// Regex: match Arabic script characters
const arabicPattern = /\p{Script=Arabic}/u;
console.log(arabicPattern.test(text)); // true
Summary
Arabic script is one of the most beautiful and complex writing systems in Unicode. Its contextual shaping, right-to-left directionality, and extensive use across dozens of languages demand careful handling from developers and designers. The key takeaways are:
- Use abstract characters (U+0600–U+06FF), never presentation forms
- Let the shaping engine handle contextual forms — do not manually select positional glyphs
- Test Bidi behavior with mixed Arabic and Latin text, especially around punctuation and numbers
- Use CSS logical properties (
inline-start,inline-end) instead ofleft/rightfor RTL layouts - Normalize text to NFC to ensure consistent representation of characters with diacritics
- Support harakat (vowel marks) even if they are optional — they are essential for Quranic text, educational materials, and disambiguation
Script Stories의 더 많은 가이드
Devanagari is an abugida script used to write Hindi, Sanskrit, Marathi, and …
Greek is one of the oldest alphabetic writing systems and gave Unicode …
Cyrillic is used to write Russian, Ukrainian, Bulgarian, Serbian, and over 50 …
Hebrew is an abjad script written right-to-left, used for Biblical Hebrew, Modern …
Thai is an abugida script with no spaces between words, complex vowel …
Japanese is unique in using three scripts simultaneously — Hiragana, Katakana, and …
Hangul was invented in 1443 by King Sejong as a scientific alphabet …
Bengali is an abugida script with over 300 million speakers, used for …
Tamil is one of the oldest living writing systems, with a literary …
The Armenian alphabet was created in 405 AD by the monk Mesrop …
Georgian has three distinct historical scripts — Mkhedruli, Asomtavruli, and Nuskhuri — …
The Ethiopic script (Ge'ez) is an abugida used to write Amharic, Tigrinya, …
Unicode encodes dozens of historic and extinct scripts — from Cuneiform and …
There are hundreds of writing systems in use around the world today, …