Rendering Complex Scripts — The Developer's Unicode Handbook

Text rendering looks simple from the outside: given a string, draw glyphs on screen. But for many of the world's scripts, there's a complex pipeline between the characters in memory and the pixels on screen. Arabic letters change shape depending on their position in a word. Devanagari consonant clusters form ligatures. Thai vowels float above and below the baseline. Emoji sequences combine multiple codepoints into a single glyph. When this pipeline fails, you get squares, tofu (□), or visually broken text that looks nothing like what users expect. This chapter explains the pipeline and where it can go wrong.

The Text Rendering Pipeline

Before a character reaches the screen, it passes through several stages:

Itemization: Split the string into runs of the same script/direction/style.
Shaping: Apply OpenType GSUB (glyph substitution) and GPOS (glyph positioning) rules to transform code points into glyphs.
BiDi algorithm: Reorder RTL runs so they flow right-to-left.
Font selection: Choose fonts and apply fallback chains when a glyph is missing.
Rasterization: Render the glyphs at the target size with antialiasing.

Most application developers interact with this through high-level APIs (CoreText on Apple, DirectWrite on Windows, Pango on Linux) which handle everything internally. But when text looks wrong, you need to understand which stage failed.

OpenType Shaping: GSUB and GPOS

OpenType fonts contain lookup tables that specify how sequences of characters should be transformed into glyphs. GSUB (Glyph SUBstitution) replaces character sequences with ligature glyphs. GPOS (Glyph POSitioning) adjusts glyph positions for mark attachment and kerning.

Arabic: Contextual Forms

Arabic letters have four forms depending on position in a word: isolated, initial, medial, and final. The shaping engine selects the correct form:

ك (isolated) + ت (isolated) = ك + ت  ← wrong, not shaped
Shaped: كت  ← engine selected medial ك and final ت

If shaping fails (as it does when rendering Arabic characters one at a time rather than as a run), you get disconnected isolated forms instead of properly joined text.

# This is why you must never manipulate Arabic text character by character
# for display purposes

# Wrong: iterate characters and render individually
arabic = "\\u0643\\u062A\\u0628"  # كتب (he wrote)
for char in arabic:
    print(char)  # ك ت ب — isolated forms, looks wrong

# Right: pass the entire string to the rendering engine
# In Python (web context), just output the whole string in HTML
print(arabic)  # كتب — browser shapes it correctly

# The lesson: never split Arabic/Hebrew text into individual characters for layout

Devanagari: Conjunct Consonants

Indic scripts like Devanagari form conjunct consonants (ligatures) from consonant + virama (halant) + consonant sequences. The virama (U+094D, ्) is a combining character that signals the following consonant should form a conjunct with the preceding one:

क + ् + ष = क् + ष = क्ष  (ligature for ksha)

If the shaping engine doesn't support the script, you'll see the individual consonants with visible virama marks instead of the correct ligature forms.

Emoji: The ZWJ Sequence Pipeline

Emoji rendering demonstrates the full complexity of the pipeline. A family emoji 👨‍👩‍👧‍👦 is a sequence of individual emoji connected by Zero Width Joiner (U+200D):

👨 U+1F468 (man)
U+200D (ZWJ)
👩 U+1F469 (woman)
U+200D (ZWJ)
👧 U+1F467 (girl)
U+200D (ZWJ)
👦 U+1F466 (boy)

The rendering engine looks up this exact sequence in the font's COLR/CPAL table or SVG table. If the sequence is found, a single composite glyph is rendered. If not (because the font doesn't support it), the renderer falls back to rendering each emoji individually, separated by invisible ZWJ characters.

This is why the same emoji sequence can look like one glyph on iOS and like a sequence of four individual emoji on older Android:

// Detecting rendering support (approximate) in JavaScript
function isEmojiSupported(emoji) {
    const canvas = document.createElement("canvas");
    canvas.width = canvas.height = 1;
    const ctx = canvas.getContext("2d");
    ctx.fillText(emoji, -4, 4);
    return ctx.getImageData(0, 0, 1, 1).data[3] > 0;
}

// For ZWJ sequences: check if width differs from individual emoji
function isZWJSequenceRenderedAsSingle(sequence) {
    const canvas = document.createElement("canvas");
    const ctx = canvas.getContext("2d");
    ctx.font = "16px sans-serif";
    const seqWidth = ctx.measureText(sequence).width;
    // Compare to width of first emoji alone
    const firstEmoji = [...sequence][0];
    const singleWidth = ctx.measureText(firstEmoji).width;
    return Math.abs(seqWidth - singleWidth) < 2;  // Approximately same width
}

Font Fallback Chains

No single font covers all of Unicode. Operating systems maintain a fallback chain: when the primary font doesn't have a glyph, the system tries the next font in the chain. The order of the fallback chain determines which font renders which characters, which affects visual consistency.

/* CSS font-family with a good fallback chain */
body {
  font-family:
    "Inter",                /* Primary Latin font */
    "Noto Sans",            /* Broad Unicode coverage */
    "Noto Sans CJK SC",     /* Simplified Chinese */
    "Noto Sans Arabic",     /* Arabic */
    "Noto Sans Devanagari", /* Hindi/Sanskrit */
    "Noto Emoji",           /* Emoji fallback */
    sans-serif;             /* System fallback */
}

The Google Noto font family ("No Tofu") is specifically designed to provide coverage for all Unicode scripts. It's the reference implementation for font fallback.

# In Python, fonttools can inspect which glyphs a font supports
from fontTools.ttLib import TTFont

def get_cmap_coverage(font_path: str) -> set[int]:
    # Get all Unicode codepoints covered by a font.
    font = TTFont(font_path)
    cmap = font.getBestCmap()
    if cmap is None:
        return set()
    return set(cmap.keys())

def font_has_glyph(font_path: str, char: str) -> bool:
    # Check if a font has a glyph for a specific character.
    coverage = get_cmap_coverage(font_path)
    return ord(char) in coverage

# Example usage
# has_arabic = font_has_glyph("/System/Library/Fonts/Arial.ttf", "\\u0643")

HarfBuzz: The Open Source Shaping Engine

HarfBuzz is the shaping engine used by Chrome, Firefox, Android, and most Linux applications. It handles GSUB/GPOS lookup, OpenType features, and complex script requirements.

# Using HarfBuzz from Python via uharfbuzz
# pip install uharfbuzz
import uharfbuzz as hb
from fontTools.ttLib import TTFont

def shape_text(font_path: str, text: str) -> list[dict]:
    # Shape text and return glyph IDs and positions.
    blob = hb.Blob.from_file_path(font_path)
    face = hb.Face(blob)
    font = hb.Font(face)

    buf = hb.Buffer()
    buf.add_str(text)
    buf.guess_segment_properties()

    hb.shape(font, buf)

    infos = buf.glyph_infos
    positions = buf.glyph_positions

    return [
        {
            "glyph_id": info.codepoint,
            "cluster": info.cluster,
            "x_advance": pos.x_advance,
            "y_advance": pos.y_advance,
            "x_offset": pos.x_offset,
            "y_offset": pos.y_offset,
        }
        for info, pos in zip(infos, positions)
    ]

Vertical Layout for CJK

East Asian text can be laid out vertically (top to bottom, right to left for columns). OpenType vert and vrt2 features substitute horizontal forms with vertical equivalents for characters that change orientation.

/* CSS for vertical CJK text */
.vertical-text {
  writing-mode: vertical-rl;    /* Vertical, right-to-left columns */
  text-orientation: mixed;       /* Rotate Latin, keep CJK upright */
}

.vertical-text-upright {
  writing-mode: vertical-lr;    /* Vertical, left-to-right columns */
  text-orientation: upright;    /* Keep all characters upright */
}

Diagnosing Rendering Failures

When text renders incorrectly, here's the diagnostic approach:

Seeing squares/tofu (□): Font fallback failed. The system has no font for this script. Install the relevant Noto font or specify it explicitly in your font stack.

Seeing isolated Arabic letters: Shaping failed. Text was split into individual characters before being passed to the renderer. Pass complete Arabic words/sentences as atomic units.

Seeing virama marks in Devanagari: The shaping engine doesn't support the conjunct formation. Use a properly configured layout engine (HarfBuzz) with a Devanagari-capable font.

Emoji showing as sequences instead of single glyphs: The emoji font doesn't support this ZWJ sequence. The sequence may be newer than the font. Either update the font or gracefully degrade.

Inconsistent appearance across platforms: Different OS font stacks are selecting different fallback fonts. Explicitly include the fonts you need in your application bundle (web fonts, embedded assets) rather than relying on system fonts.

The golden rule of complex script rendering: never split text into individual characters for processing or layout purposes. Always work with complete words, at minimum. The shaping engine needs the full sequence context to produce correct output.