📜 Script Stories

Hebrew Script

Hebrew is an abjad script written right-to-left, used for Biblical Hebrew, Modern Hebrew, and Yiddish, with optional vowel diacritics called niqqud that are encoded as combining characters. This guide covers the Hebrew Unicode block, how the bidirectional algorithm handles Hebrew text, and the history of this ancient script.

·

Hebrew is one of the oldest writing systems still in daily use. Its alphabet has been in continuous use for over 3,000 years — from the ancient inscriptions of Iron Age Israel to the smartphones of modern Tel Aviv. Hebrew script is used for both Modern Hebrew (spoken by 9 million people) and Yiddish (with its distinct set of orthographic conventions), as well as Ladino and Judeo-Arabic. For Unicode, Hebrew presents a fascinating combination of challenges: right-to-left directionality, optional vowel points (nikkud), cantillation marks for biblical text, and a tradition of calligraphic and typographic complexity. This guide explores how Unicode encodes Hebrew script, how the bidirectional algorithm handles it, and what developers need to know.

History

Hebrew script descends from the Phoenician alphabet (c. 1050 BCE), one of the earliest alphabetic writing systems. The earliest known Hebrew inscriptions date to the 10th century BCE. The modern "square" letter forms (Ktav Ashuri) were adopted during the Babylonian exile (6th century BCE) and have remained largely unchanged for over 2,000 years.

Period Script Form Example
10th c. BCE Paleo-Hebrew Gezer Calendar inscription
6th c. BCE Square script (Ktav Ashuri) Dead Sea Scrolls
2nd c. CE Mishna, Talmud manuscripts Standardized square forms
10th c. CE Tiberian vocalization system Masoretic Text of the Bible
1880s Modern Hebrew revival Eliezer Ben-Yehuda
Today Modern Hebrew (Israel) 9M+ speakers

The revival of Hebrew as a spoken language in the late 19th and early 20th centuries is one of the most remarkable linguistic achievements in history. Hebrew went from being primarily a liturgical and scholarly language to the everyday tongue of an entire nation.

The Hebrew Alphabet

Hebrew has 22 consonant letters. It is an abjad — a writing system that primarily represents consonants, with vowels optionally indicated by diacritical marks:

Letter Name Unicode Transliteration Final Form
א Alef U+05D0 (glottal stop)
ב Bet U+05D1 b/v
ג Gimel U+05D2 g
ד Dalet U+05D3 d
ה He U+05D4 h
ו Vav U+05D5 v/o/u
ז Zayin U+05D6 z
ח Het U+05D7 ch
ט Tet U+05D8 t
י Yod U+05D9 y/i
כ Kaf U+05DB k/kh ך U+05DA
ל Lamed U+05DC l
מ Mem U+05DE m ם U+05DD
נ Nun U+05E0 n ן U+05DF
ס Samekh U+05E1 s
ע Ayin U+05E2 (pharyngeal)
פ Pe U+05E4 p/f ף U+05E3
צ Tsadi U+05E6 ts ץ U+05E5
ק Qof U+05E7 q
ר Resh U+05E8 r
ש Shin U+05E9 sh/s
ת Tav U+05EA t

Final Forms (Sofit)

Five Hebrew letters have alternate forms used when they appear at the end of a word: Kaf (ך), Mem (ם), Nun (ן), Pe (ף), and Tsadi (ץ). These are encoded as separate characters in Unicode — they are not contextual variants like Arabic positional forms.

# Final forms are distinct code points
FINAL_FORMS: dict[str, str] = {
    "\u05DB": "\u05DA",  # Kaf → Final Kaf
    "\u05DE": "\u05DD",  # Mem → Final Mem
    "\u05E0": "\u05DF",  # Nun → Final Nun
    "\u05E4": "\u05E3",  # Pe → Final Pe
    "\u05E6": "\u05E5",  # Tsadi → Final Tsadi
}

Unicode Blocks for Hebrew

Block Range Characters Purpose
Hebrew U+0590–U+05FF 88 Consonants, vowels, accents
Alphabetic Presentation Forms U+FB00–U+FB4F 58 (Hebrew subset) Ligatures, wide/alternative letters

The Main Hebrew Block (U+0590–U+05FF)

This block is organized into three sections:

  1. Cantillation marks (U+0591–U+05AF): Accents used in biblical text
  2. Points and vowels (U+05B0–U+05BD, U+05BF, U+05C1–U+05C2, U+05C4–U+05C5): Nikkud
  3. Letters (U+05D0–U+05EA): The 22 consonants + 5 final forms

Alphabetic Presentation Forms (Hebrew Subset)

The Alphabetic Presentation Forms block (U+FB1D–U+FB4F) contains:

  • Wide letters for justified text
  • Alternative letter forms (e.g., alternative Ayin)
  • Yiddish ligatures (e.g., double Vav, Vav-Yod, double Yod)
  • Precomposed letter + dagesh combinations

Like Arabic Presentation Forms, these are primarily for compatibility. New text should use the base characters from the main Hebrew block.

Nikkud: The Vowel System

In everyday Modern Hebrew, text is written without vowel marks (ktiv maleh — "full writing" uses matres lectionis: Vav and Yod as vowel indicators). The full vowel system, called nikkud (ניקוד, "dotting"), is used in:

  • The Torah and other religious texts
  • Children's books and educational materials
  • Dictionaries and poetry
  • Disambiguation of homographs
  • Texts for Hebrew language learners

The Vowel Marks

Mark Name Unicode Sound Position
ַ Patach U+05B7 /a/ Below
ָ Qamats U+05B8 /a/ or /o/ Below
ֶ Segol U+05B6 /e/ Below
ֵ Tsere U+05B5 /e/ Below
ִ Hiriq U+05B4 /i/ Below
ֹ Holam U+05B9 /o/ Above
ֻ Qubuts U+05BB /u/ Below
ְ Shva U+05B0 /e/ or silent Below
ֲ Hataf Patach U+05B2 /a/ (reduced) Below
ֳ Hataf Qamats U+05B3 /o/ (reduced) Below
ֱ Hataf Segol U+05B1 /e/ (reduced) Below

The Dagesh

The dagesh (דגש, U+05BC) is a dot placed inside a consonant that changes its pronunciation. There are two types:

  • Dagesh Kal (light): Changes fricative to plosive (e.g., ב /v/ → בּ /b/)
  • Dagesh Chazak (strong): Indicates gemination (doubling of the consonant)

Six letters change pronunciation with dagesh: Bet (בּ/ב), Gimel (גּ/ג), Dalet (דּ/ד), Kaf (כּ/כ), Pe (פּ/פ), Tav (תּ/ת). These are known as the BeGeD KeFeT letters.

The Shin Dot and Sin Dot

The letter Shin (ש) represents two different sounds, distinguished by a dot:

Form Name Unicode Sequence Sound
שׁ Shin U+05E9 + U+05C1 /sh/
שׂ Sin U+05E9 + U+05C2 /s/

The dot (shin dot or sin dot) is a combining mark placed above-right or above-left of the letter.

Encoding Order for Pointed Text

When a consonant has multiple marks (vowel, dagesh, cantillation), they must be stored in a specific order. Unicode's canonical ordering for Hebrew combining marks follows this pattern:

Base consonant + Shin/Sin dot + Dagesh + Vowel + Cantillation marks

Example: שָׁלוֹם (shalom) is encoded as:

U+05E9  SHIN          ש
U+05C1  SHIN DOT      ׁ  (marks shin, not sin)
U+05B8  QAMATS        ָ  (vowel /a/)
U+05DC  LAMED         ל
U+05D5  VAV           ו
U+05B9  HOLAM         ֹ  (vowel /o/)
U+05DD  FINAL MEM     ם
import unicodedata

shalom = "\u05E9\u05C1\u05B8\u05DC\u05D5\u05B9\u05DD"
print(shalom)  # שָׁלוֹם

# Inspect each code point
for ch in shalom:
    print(f"  U+{ord(ch):04X} {unicodedata.name(ch)} "
          f"cat={unicodedata.category(ch)}")

Cantillation Marks (Te'amim)

For biblical Hebrew text, Unicode provides a comprehensive set of cantillation marks (טעמים, te'amim) — accent marks that indicate melodic patterns for liturgical reading. These occupy U+0591–U+05AF in the Hebrew block:

Mark Name Unicode Position
֑ Etnahta U+0591 Below
֒ Segol (accent) U+0592 Above
֓ Shalshelet U+0593 Above
֔ Zaqef Qatan U+0594 Above
֕ Zaqef Gadol U+0595 Above
֖ Tipeha U+0596 Below
֗ Revia U+0597 Above
֚ Yetiv U+059A Below
֛ Tevir U+059B Below
֣ Munah U+05A3 Below
֤ Mahapakh U+05A4 Below
֥ Merkha U+05A5 Below

A fully pointed and accented biblical text can have three or more combining marks on a single consonant — a vowel, a dagesh, and one or more cantillation marks.

Bidirectional Text

Hebrew, like Arabic, is written right-to-left (RTL). The Unicode Bidirectional Algorithm (UBA) handles Hebrew text alongside LTR content. Hebrew characters have Bidi_Class R (Right-to-Left).

Common Bidi Challenges with Hebrew

Mixing Hebrew and English:

<!-- Proper isolation of embedded LTR text -->
<p dir="rtl">הפרוטוקול <bdi>HTTP/2</bdi> הוא מהיר יותר.</p>

Numbers in Hebrew text: Hebrew uses Western digits (0-9), which are classified as European Number (EN) in the Bidi algorithm. They generally render correctly, but punctuation adjacent to numbers can jump to unexpected positions.

Parentheses and brackets: These are neutral characters whose direction is resolved by context. In Hebrew text, parentheses are automatically mirrored:

English: Hello (world)
Hebrew:  (שלום (עולם    — parentheses mirror in RTL context

HTML/CSS for Hebrew

<html dir="rtl" lang="he">
<head>
  <style>
    body {
      direction: rtl;
      unicode-bidi: isolate;
      text-align: start;
      font-family: "Frank Ruhl Libre", "David", serif;
    }

    /* CSS logical properties */
    .indent {
      margin-inline-start: 2rem;  /* Right margin in RTL */
      padding-inline-end: 1rem;   /* Left padding in RTL */
    }

    /* Pointed text needs extra line-height for marks */
    .nikkud {
      line-height: 2;
    }
  </style>
</head>

Yiddish in Unicode

Yiddish uses Hebrew script but with significant orthographic differences. While Hebrew uses consonant letters with optional vowels, Yiddish uses certain Hebrew letters as full vowels:

Hebrew Letter Yiddish Use Sound
א (Alef) Silent or /a/ Depends on context
אַ (Alef + Patach) /a/ Always /a/
אָ (Alef + Qamats) /o/ Always /o/
ו (Vav) /u/ Always /u/
וּ (Vav + dagesh) /u/ (explicitly marked) /u/
וו (double Vav) /v/ Consonant
י (Yod) /i/ Always /i/
יי (double Yod) /ey/ Diphthong
ײַ (double Yod + Patach) /ay/ Diphthong

The Alphabetic Presentation Forms block includes Yiddish ligatures:

Character Unicode Name
U+FB20 ALTERNATIVE AYIN
U+FB21 WIDE ALEF
ײ U+FB1F YIDDISH YOD YOD PATACH
וו U+FB35 VAV WITH DAGESH (for Yiddish /u/)

Gematria: Numerical Values

Hebrew letters have traditional numerical values, a system called gematria (גימטריה). This is used in religious texts, dates on Jewish gravestones, and page numbering in some Hebrew books:

Letters Values
א-ט 1–9
י-צ 10–90
ק-ת 100–400
GEMATRIA: dict[str, int] = {
    "\u05D0": 1, "\u05D1": 2, "\u05D2": 3, "\u05D3": 4,
    "\u05D4": 5, "\u05D5": 6, "\u05D6": 7, "\u05D7": 8,
    "\u05D8": 9, "\u05D9": 10, "\u05DB": 20, "\u05DC": 30,
    "\u05DE": 40, "\u05E0": 50, "\u05E1": 60, "\u05E2": 70,
    "\u05E4": 80, "\u05E6": 90, "\u05E7": 100, "\u05E8": 200,
    "\u05E9": 300, "\u05EA": 400,
}

def gematria_value(word: str) -> int:
    # Calculate the gematria value of a Hebrew word.
    return sum(GEMATRIA.get(ch, 0) for ch in word)

# שלום (shalom) = 300 + 30 + 6 + 40 = 376
print(gematria_value("\u05E9\u05DC\u05D5\u05DD"))  # 376

Working with Hebrew in Code

Python

import unicodedata

# Modern Hebrew (unpointed)
text = "\u05E9\u05DC\u05D5\u05DD"  # שלום (shalom)
print(len(text))  # 4 — one code point per letter

# Strip nikkud from pointed text
def strip_nikkud(text: str) -> str:
    # Remove vowel points and cantillation marks.
    return "".join(
        ch for ch in text
        if unicodedata.category(ch) != "Mn"
        or not (0x0591 <= ord(ch) <= 0x05C7)
    )

pointed = "\u05E9\u05C1\u05B8\u05DC\u05D5\u05B9\u05DD"  # שָׁלוֹם
print(strip_nikkud(pointed))  # שלום

JavaScript

// Match Hebrew characters
const hebrewPattern = /\p{Script=Hebrew}/u;
const text = "\u05E9\u05DC\u05D5\u05DD";
console.log(hebrewPattern.test(text)); // true

// Strip nikkud (combining marks in Hebrew range)
function stripNikkud(text) {
    return text.normalize("NFD").replace(/[\u0591-\u05C7]/g, "");
}

console.log(stripNikkud("\u05E9\u05C1\u05B8\u05DC\u05D5\u05B9\u05DD"));
// שלום

Summary

Hebrew script combines ancient tradition with modern practicality. Its Unicode encoding handles everything from casual Modern Hebrew text messages to fully pointed and accented biblical manuscripts. Key takeaways:

  1. Hebrew is an abjad — consonant-only writing with optional vowels (nikkud) encoded as combining marks
  2. Final forms are separate code points — unlike Arabic contextual shaping, Hebrew final letters (ך, ם, ן, ף, ץ) have their own code points
  3. Nikkud order matters — follow Unicode canonical ordering: base letter, shin/sin dot, dagesh, vowel, cantillation
  4. Right-to-left handling requires proper dir="rtl" attributes and CSS logical properties
  5. Strip nikkud for search — Modern Hebrew text is usually unpointed, so search logic should normalize by removing combining marks
  6. Yiddish uses Hebrew letters differently — certain letters serve as vowels, and Yiddish has its own ligatures in the Presentation Forms block
  7. Biblical Hebrew adds cantillation marks on top of nikkud, potentially stacking 3+ combining marks per consonant — ensure adequate line-height

المزيد في Script Stories