📜 Script Stories

Hebrew Script

Hebrew is an abjad script written right-to-left, used for Biblical Hebrew, Modern Hebrew, and Yiddish, with optional vowel diacritics called niqqud that are encoded as combining characters. This guide covers the Hebrew Unicode block, how the bidirectional algorithm handles Hebrew text, and the history of this ancient script.

Published 2023-07-03 · Updated 2025-04-21

Hebrew is one of the oldest writing systems still in daily use. Its alphabet has been in continuous use for over 3,000 years — from the ancient inscriptions of Iron Age Israel to the smartphones of modern Tel Aviv. Hebrew script is used for both Modern Hebrew (spoken by 9 million people) and Yiddish (with its distinct set of orthographic conventions), as well as Ladino and Judeo-Arabic. For Unicode, Hebrew presents a fascinating combination of challenges: right-to-left directionality, optional vowel points (nikkud), cantillation marks for biblical text, and a tradition of calligraphic and typographic complexity. This guide explores how Unicode encodes Hebrew script, how the bidirectional algorithm handles it, and what developers need to know.

History

Hebrew script descends from the Phoenician alphabet (c. 1050 BCE), one of the earliest alphabetic writing systems. The earliest known Hebrew inscriptions date to the 10th century BCE. The modern "square" letter forms (Ktav Ashuri) were adopted during the Babylonian exile (6th century BCE) and have remained largely unchanged for over 2,000 years.

Period	Script Form	Example
10th c. BCE	Paleo-Hebrew	Gezer Calendar inscription
6th c. BCE	Square script (Ktav Ashuri)	Dead Sea Scrolls
2nd c. CE	Mishna, Talmud manuscripts	Standardized square forms
10th c. CE	Tiberian vocalization system	Masoretic Text of the Bible
1880s	Modern Hebrew revival	Eliezer Ben-Yehuda
Today	Modern Hebrew (Israel)	9M+ speakers

The revival of Hebrew as a spoken language in the late 19th and early 20th centuries is one of the most remarkable linguistic achievements in history. Hebrew went from being primarily a liturgical and scholarly language to the everyday tongue of an entire nation.

The Hebrew Alphabet

Hebrew has 22 consonant letters. It is an abjad — a writing system that primarily represents consonants, with vowels optionally indicated by diacritical marks:

Letter	Name	Unicode	Transliteration	Final Form
א	Alef	U+05D0	(glottal stop)	—
ב	Bet	U+05D1	b/v	—
ג	Gimel	U+05D2	g	—
ד	Dalet	U+05D3	d	—
ה	He	U+05D4	h	—
ו	Vav	U+05D5	v/o/u	—
ז	Zayin	U+05D6	z	—
ח	Het	U+05D7	ch	—
ט	Tet	U+05D8	t	—
י	Yod	U+05D9	y/i	—
כ	Kaf	U+05DB	k/kh	ך U+05DA
ל	Lamed	U+05DC	l	—
מ	Mem	U+05DE	m	ם U+05DD
נ	Nun	U+05E0	n	ן U+05DF
ס	Samekh	U+05E1	s	—
ע	Ayin	U+05E2	(pharyngeal)	—
פ	Pe	U+05E4	p/f	ף U+05E3
צ	Tsadi	U+05E6	ts	ץ U+05E5
ק	Qof	U+05E7	q	—
ר	Resh	U+05E8	r	—
ש	Shin	U+05E9	sh/s	—
ת	Tav	U+05EA	t	—

Final Forms (Sofit)

Five Hebrew letters have alternate forms used when they appear at the end of a word: Kaf (ך), Mem (ם), Nun (ן), Pe (ף), and Tsadi (ץ). These are encoded as separate characters in Unicode — they are not contextual variants like Arabic positional forms.

# Final forms are distinct code points
FINAL_FORMS: dict[str, str] = {
    "\u05DB": "\u05DA",  # Kaf → Final Kaf
    "\u05DE": "\u05DD",  # Mem → Final Mem
    "\u05E0": "\u05DF",  # Nun → Final Nun
    "\u05E4": "\u05E3",  # Pe → Final Pe
    "\u05E6": "\u05E5",  # Tsadi → Final Tsadi
}

Unicode Blocks for Hebrew

Block	Range	Characters	Purpose
Hebrew	U+0590–U+05FF	88	Consonants, vowels, accents
Alphabetic Presentation Forms	U+FB00–U+FB4F	58 (Hebrew subset)	Ligatures, wide/alternative letters

The Main Hebrew Block (U+0590–U+05FF)

This block is organized into three sections:

Cantillation marks (U+0591–U+05AF): Accents used in biblical text
Points and vowels (U+05B0–U+05BD, U+05BF, U+05C1–U+05C2, U+05C4–U+05C5): Nikkud
Letters (U+05D0–U+05EA): The 22 consonants + 5 final forms

Alphabetic Presentation Forms (Hebrew Subset)

The Alphabetic Presentation Forms block (U+FB1D–U+FB4F) contains:

Wide letters for justified text
Alternative letter forms (e.g., alternative Ayin)
Yiddish ligatures (e.g., double Vav, Vav-Yod, double Yod)
Precomposed letter + dagesh combinations

Like Arabic Presentation Forms, these are primarily for compatibility. New text should use the base characters from the main Hebrew block.

Nikkud: The Vowel System

In everyday Modern Hebrew, text is written without vowel marks (ktiv maleh — "full writing" uses matres lectionis: Vav and Yod as vowel indicators). The full vowel system, called nikkud (ניקוד, "dotting"), is used in:

The Torah and other religious texts
Children's books and educational materials
Dictionaries and poetry
Disambiguation of homographs
Texts for Hebrew language learners

The Vowel Marks

Mark	Name	Unicode	Sound	Position
ַ	Patach	U+05B7	/a/	Below
ָ	Qamats	U+05B8	/a/ or /o/	Below
ֶ	Segol	U+05B6	/e/	Below
ֵ	Tsere	U+05B5	/e/	Below
ִ	Hiriq	U+05B4	/i/	Below
ֹ	Holam	U+05B9	/o/	Above
ֻ	Qubuts	U+05BB	/u/	Below
ְ	Shva	U+05B0	/e/ or silent	Below
ֲ	Hataf Patach	U+05B2	/a/ (reduced)	Below
ֳ	Hataf Qamats	U+05B3	/o/ (reduced)	Below
ֱ	Hataf Segol	U+05B1	/e/ (reduced)	Below

The Dagesh

The dagesh (דגש, U+05BC) is a dot placed inside a consonant that changes its pronunciation. There are two types:

Dagesh Kal (light): Changes fricative to plosive (e.g., ב /v/ → בּ /b/)
Dagesh Chazak (strong): Indicates gemination (doubling of the consonant)

Six letters change pronunciation with dagesh: Bet (בּ/ב), Gimel (גּ/ג), Dalet (דּ/ד), Kaf (כּ/כ), Pe (פּ/פ), Tav (תּ/ת). These are known as the BeGeD KeFeT letters.

The Shin Dot and Sin Dot

The letter Shin (ש) represents two different sounds, distinguished by a dot:

Form	Name	Unicode Sequence	Sound
שׁ	Shin	U+05E9 + U+05C1	/sh/
שׂ	Sin	U+05E9 + U+05C2	/s/

The dot (shin dot or sin dot) is a combining mark placed above-right or above-left of the letter.

Encoding Order for Pointed Text

When a consonant has multiple marks (vowel, dagesh, cantillation), they must be stored in a specific order. Unicode's canonical ordering for Hebrew combining marks follows this pattern:

Base consonant + Shin/Sin dot + Dagesh + Vowel + Cantillation marks

Example: שָׁלוֹם (shalom) is encoded as:

U+05E9  SHIN          ש
U+05C1  SHIN DOT      ׁ  (marks shin, not sin)
U+05B8  QAMATS        ָ  (vowel /a/)
U+05DC  LAMED         ל
U+05D5  VAV           ו
U+05B9  HOLAM         ֹ  (vowel /o/)
U+05DD  FINAL MEM     ם

import unicodedata

shalom = "\u05E9\u05C1\u05B8\u05DC\u05D5\u05B9\u05DD"
print(shalom)  # שָׁלוֹם

# Inspect each code point
for ch in shalom:
    print(f"  U+{ord(ch):04X} {unicodedata.name(ch)} "
          f"cat={unicodedata.category(ch)}")

Cantillation Marks (Te'amim)

For biblical Hebrew text, Unicode provides a comprehensive set of cantillation marks (טעמים, te'amim) — accent marks that indicate melodic patterns for liturgical reading. These occupy U+0591–U+05AF in the Hebrew block:

Mark	Name	Unicode	Position
֑	Etnahta	U+0591	Below
֒	Segol (accent)	U+0592	Above
֓	Shalshelet	U+0593	Above
֔	Zaqef Qatan	U+0594	Above
֕	Zaqef Gadol	U+0595	Above
֖	Tipeha	U+0596	Below
֗	Revia	U+0597	Above
֚	Yetiv	U+059A	Below
֛	Tevir	U+059B	Below
֣	Munah	U+05A3	Below
֤	Mahapakh	U+05A4	Below
֥	Merkha	U+05A5	Below

A fully pointed and accented biblical text can have three or more combining marks on a single consonant — a vowel, a dagesh, and one or more cantillation marks.

Bidirectional Text

Hebrew, like Arabic, is written right-to-left (RTL). The Unicode Bidirectional Algorithm (UBA) handles Hebrew text alongside LTR content. Hebrew characters have Bidi_Class R (Right-to-Left).

Common Bidi Challenges with Hebrew

Mixing Hebrew and English:

<!-- Proper isolation of embedded LTR text -->
<p dir="rtl">הפרוטוקול <bdi>HTTP/2</bdi> הוא מהיר יותר.</p>

Numbers in Hebrew text: Hebrew uses Western digits (0-9), which are classified as European Number (EN) in the Bidi algorithm. They generally render correctly, but punctuation adjacent to numbers can jump to unexpected positions.

Parentheses and brackets: These are neutral characters whose direction is resolved by context. In Hebrew text, parentheses are automatically mirrored:

English: Hello (world)
Hebrew:  (שלום (עולם    — parentheses mirror in RTL context

HTML/CSS for Hebrew

<html dir="rtl" lang="he">
<head>
  <style>
    body {
      direction: rtl;
      unicode-bidi: isolate;
      text-align: start;
      font-family: "Frank Ruhl Libre", "David", serif;
    }

    /* CSS logical properties */
    .indent {
      margin-inline-start: 2rem;  /* Right margin in RTL */
      padding-inline-end: 1rem;   /* Left padding in RTL */
    }

    /* Pointed text needs extra line-height for marks */
    .nikkud {
      line-height: 2;
    }
  </style>
</head>

Yiddish in Unicode

Yiddish uses Hebrew script but with significant orthographic differences. While Hebrew uses consonant letters with optional vowels, Yiddish uses certain Hebrew letters as full vowels:

Hebrew Letter	Yiddish Use	Sound
א (Alef)	Silent or /a/	Depends on context
אַ (Alef + Patach)	/a/	Always /a/
אָ (Alef + Qamats)	/o/	Always /o/
ו (Vav)	/u/	Always /u/
וּ (Vav + dagesh)	/u/ (explicitly marked)	/u/
וו (double Vav)	/v/	Consonant
י (Yod)	/i/	Always /i/
יי (double Yod)	/ey/	Diphthong
ײַ (double Yod + Patach)	/ay/	Diphthong

The Alphabetic Presentation Forms block includes Yiddish ligatures:

Character	Unicode	Name
ﬠ	U+FB20	ALTERNATIVE AYIN
ﬡ	U+FB21	WIDE ALEF
ײ	U+FB1F	YIDDISH YOD YOD PATACH
וו	U+FB35	VAV WITH DAGESH (for Yiddish /u/)

Gematria: Numerical Values

Hebrew letters have traditional numerical values, a system called gematria (גימטריה). This is used in religious texts, dates on Jewish gravestones, and page numbering in some Hebrew books:

Letters	Values
א-ט	1–9
י-צ	10–90
ק-ת	100–400

GEMATRIA: dict[str, int] = {
    "\u05D0": 1, "\u05D1": 2, "\u05D2": 3, "\u05D3": 4,
    "\u05D4": 5, "\u05D5": 6, "\u05D6": 7, "\u05D7": 8,
    "\u05D8": 9, "\u05D9": 10, "\u05DB": 20, "\u05DC": 30,
    "\u05DE": 40, "\u05E0": 50, "\u05E1": 60, "\u05E2": 70,
    "\u05E4": 80, "\u05E6": 90, "\u05E7": 100, "\u05E8": 200,
    "\u05E9": 300, "\u05EA": 400,
}

def gematria_value(word: str) -> int:
    # Calculate the gematria value of a Hebrew word.
    return sum(GEMATRIA.get(ch, 0) for ch in word)

# שלום (shalom) = 300 + 30 + 6 + 40 = 376
print(gematria_value("\u05E9\u05DC\u05D5\u05DD"))  # 376

Working with Hebrew in Code

Python

import unicodedata

# Modern Hebrew (unpointed)
text = "\u05E9\u05DC\u05D5\u05DD"  # שלום (shalom)
print(len(text))  # 4 — one code point per letter

# Strip nikkud from pointed text
def strip_nikkud(text: str) -> str:
    # Remove vowel points and cantillation marks.
    return "".join(
        ch for ch in text
        if unicodedata.category(ch) != "Mn"
        or not (0x0591 <= ord(ch) <= 0x05C7)
    )

pointed = "\u05E9\u05C1\u05B8\u05DC\u05D5\u05B9\u05DD"  # שָׁלוֹם
print(strip_nikkud(pointed))  # שלום

JavaScript

// Match Hebrew characters
const hebrewPattern = /\p{Script=Hebrew}/u;
const text = "\u05E9\u05DC\u05D5\u05DD";
console.log(hebrewPattern.test(text)); // true

// Strip nikkud (combining marks in Hebrew range)
function stripNikkud(text) {
    return text.normalize("NFD").replace(/[\u0591-\u05C7]/g, "");
}

console.log(stripNikkud("\u05E9\u05C1\u05B8\u05DC\u05D5\u05B9\u05DD"));
// שלום

Summary

Hebrew script combines ancient tradition with modern practicality. Its Unicode encoding handles everything from casual Modern Hebrew text messages to fully pointed and accented biblical manuscripts. Key takeaways:

Hebrew is an abjad — consonant-only writing with optional vowels (nikkud) encoded as combining marks
Final forms are separate code points — unlike Arabic contextual shaping, Hebrew final letters (ך, ם, ן, ף, ץ) have their own code points
Nikkud order matters — follow Unicode canonical ordering: base letter, shin/sin dot, dagesh, vowel, cantillation
Right-to-left handling requires proper dir="rtl" attributes and CSS logical properties
Strip nikkud for search — Modern Hebrew text is usually unpointed, so search logic should normalize by removing combining marks
Yiddish uses Hebrew letters differently — certain letters serve as vowels, and Yiddish has its own ligatures in the Presentation Forms block
Biblical Hebrew adds cantillation marks on top of nikkud, potentially stacking 3+ combining marks per consonant — ensure adequate line-height