📜 Script Stories

Thai Script

Thai is an abugida script with no spaces between words, complex vowel placement above and below consonants, and tone marks that affect meaning — all of which require sophisticated Unicode rendering. This guide explores the Thai Unicode block, how Thai text is encoded and segmented, and the challenges of Thai natural language processing.

·

Thai script is one of the most distinctive writing systems encoded in Unicode. Used by over 60 million people in Thailand and recognized as one of the world's major scripts, Thai presents unique challenges for digital text processing: words are written without spaces, vowels can appear above, below, before, or after their consonant, and tone marks stacked above vowels alter meaning entirely. This guide explores the Thai Unicode block, how Thai text is encoded and rendered, and the computational challenges of working with Thai strings.

A Brief History of Thai Script

Thai script descended from the Khmer script, which itself traces back through Old Khmer to the South Indian Pallava script and ultimately to the Brahmi script of ancient India. King Ramkhamhaeng the Great is traditionally credited with creating the Thai writing system in 1283 CE, as described in the famous Ramkhamhaeng Inscription (the oldest surviving Thai text). Whether Ramkhamhaeng invented the script outright or formalized an existing system is debated, but the inscription establishes Thai as a distinct script by the late 13th century.

Over seven centuries, the script has remained remarkably stable. Modern Thai readers can still decipher the Ramkhamhaeng Inscription, though spelling conventions and some character forms have evolved. The script spread beyond Thailand to influence the development of Lao, which shares a common ancestor and looks visibly similar to Thai.

How Thai Script Works

Thai is an abugida (also called an alphasyllabary) — a writing system where consonant letters carry an inherent vowel that is modified by diacritical marks to indicate different vowels and tones.

Consonants

Thai has 44 consonant letters representing 21 distinct consonant sounds. The surplus exists because multiple letters can represent the same sound but belong to different consonant classes (high, mid, low), which determine the tone of the syllable:

Class Count Example Letters Effect on Tone
High class 11 ข (kho khai), ฉ (cho ching), ถ (tho thung) Rising tone in live syllables
Mid class 9 ก (ko kai), จ (cho chan), ด (do chada) Mid tone in live syllables
Low class 24 ค (kho khwai), ง (ngo ngu), ช (cho chang) Mid tone in live syllables

This three-class system is central to Thai phonology and has no parallel in Latin-based writing. The class of the initial consonant, combined with the vowel length, syllable type (live or dead), and any explicit tone mark, determines which of the five tones (mid, low, falling, high, rising) a syllable carries.

Vowels

Thai has approximately 32 vowel forms that are written in four different positions relative to the consonant:

Position Example Description
Before (left) เ- The vowel symbol appears before the consonant it follows phonetically
After (right) -า Written after the consonant, read after
Above -ิ, -ี, -ึ, -ื Written above the consonant
Below -ุ, -ู Written below the consonant

Some vowels are composite, combining symbols from multiple positions. For example, เ-ือ (ua) uses a symbol before the consonant, one above, and one after. The key insight is that the visual order of Thai text does not always match the phonological order — a vowel written to the left of a consonant is actually pronounced after it.

Tone Marks

Thai has four explicit tone marks that appear above the consonant (or above a vowel that sits above the consonant):

Mark Name Unicode Effect
Mai ek U+0E48 Low tone (with mid-class consonant)
Mai tho U+0E49 Falling tone (with mid-class consonant)
Mai tri U+0E4A High tone (with mid-class consonant)
Mai chattawa U+0E4B Rising tone (with mid-class consonant)

When a syllable has both an above-vowel and a tone mark, the tone mark stacks on top of the vowel, creating a three-level vertical cluster: consonant at the base, vowel above, tone mark on top. This stacking is critical for rendering engines.

No Spaces Between Words

Unlike English, Thai does not use spaces to separate words. A Thai sentence is a continuous stream of characters, and word boundaries must be inferred from context. Spaces in Thai are used only between sentences or clauses (loosely, like paragraph breaks).

For example, the phrase "Thai language" is written:

ภาษาไทย

This is two words (ภาษา = language, ไทย = Thai) written without any separator. A reader or algorithm must know the vocabulary to identify the boundary.

The Unicode Thai Block

Thai script is encoded in a single contiguous Unicode block:

Block Range Characters
Thai U+0E00 – U+0E7F 87 assigned characters

The block is organized logically:

Range Content Count
U+0E01 – U+0E2E Consonants 44
U+0E2F Character Pai Yan Noi (abbreviation) 1
U+0E30 – U+0E3A Vowels and virama 11
U+0E40 – U+0E45 Leading vowels 6
U+0E46 Character Mai Yamok (repetition) 1
U+0E47 – U+0E4E Vowels and tone marks (above) 8
U+0E4F Fongman (ornamental mark) 1
U+0E50 – U+0E59 Thai digits 0–9 10
U+0E5A – U+0E5B Angkhankhu, Khomut (punctuation) 2

Thai Digits

Thai has its own set of digits, though Arabic (Western) digits are also widely used in modern Thailand:

Thai Arabic Code Point
0 U+0E50
1 U+0E51
2 U+0E52
3 U+0E53
4 U+0E54
5 U+0E55
6 U+0E56
7 U+0E57
8 U+0E58
9 U+0E59

Character Encoding and Rendering

Storage Order vs. Visual Order

Thai text is stored in logical order — consonant first, then any above/below vowels and tone marks. The rendering engine is responsible for placing the marks in the correct visual positions:

Stored:     ก (U+0E01) + ุ (U+0E38) + ่ (U+0E48)
Rendered:   กุ่  (consonant with vowel below and tone mark above)

For leading vowels (those written to the left of the consonant), the vowel is stored before the consonant in the character stream, matching visual order:

Stored:     เ (U+0E40) + ก (U+0E01)
Rendered:   เก  (vowel appears left of consonant)

This design means that caret movement and text selection in Thai follow the storage order, which generally matches left-to-right visual order.

Stacking Rules

Thai rendering requires handling up to three vertical levels above a consonant:

  1. Level 0: Base consonant
  2. Level 1: Above-vowel (e.g., -ิ U+0E34, -ี U+0E35)
  3. Level 2: Tone mark (e.g., ่ U+0E48) or thanthakhat (U+0E4C)

Below the consonant, there is one level for below-vowels (-ุ U+0E38, -ู U+0E39).

Rendering engines (such as HarfBuzz or the Windows Uniscribe shaper) use OpenType GPOS (Glyph Positioning) tables to correctly position these stacked marks. Without proper shaping, marks may overlap or appear at incorrect positions.

Common Rendering Issues

Problem Cause Solution
Marks overlap consonant Missing GPOS data in font Use a font with Thai OpenType tables
Tone mark misplaced Shaping engine not Thai-aware Use HarfBuzz or equivalent shaper
Leading vowel detached Line break inserted between vowel and consonant Apply Thai line-breaking rules
Characters display as boxes Font lacks Thai glyphs Install a Thai-supporting font (Noto Sans Thai, Sarabun)

Word Segmentation: The Central Challenge

The absence of spaces between words makes Thai word segmentation one of the hardest NLP problems for the language. Every operation that depends on word boundaries — search, spell checking, line breaking, text-to-speech — requires a segmentation algorithm.

Dictionary-Based Segmentation

The most common approach uses a dictionary of known Thai words to find the maximal matching segmentation:

# Using PyThaiNLP — the standard Thai NLP library
from pythainlp.tokenize import word_tokenize

text = "ภาษาไทยไม่มีช่องว่างระหว่างคำ"
words = word_tokenize(text, engine="newmm")
print(words)
# ['ภาษาไทย', 'ไม่', 'มี', 'ช่องว่าง', 'ระหว่าง', 'คำ']

ICU and Line Breaking

The Unicode Consortium provides the Thai line-breaking algorithm as part of ICU (International Components for Unicode), which uses a dictionary-based approach to identify safe line-break positions within Thai text:

import icu  # PyICU

bi = icu.BreakIterator.createWordInstance(icu.Locale("th_TH"))
bi.setText("ภาษาไทยไม่มีช่องว่างระหว่างคำ")

boundaries = []
pos = bi.nextBoundary()
while pos != icu.BreakIterator.DONE:
    boundaries.append(pos)
    pos = bi.nextBoundary()

print(boundaries)
# Word boundary positions in the string

Web Browsers and CSS

Modern browsers implement Thai line-breaking internally. The CSS word-break property interacts with Thai text:

/* Allow line breaks within Thai "words" (the browser determines boundaries) */
.thai-text {
  word-break: break-word;
  overflow-wrap: break-word;
  line-height: 1.8; /* Extra line height for above/below marks */
}

Working with Thai Text in Code

Python

import unicodedata

# Thai character properties
char = "\u0E01"  # ก (ko kai)
print(unicodedata.name(char))       # THAI CHARACTER KO KAI
print(unicodedata.category(char))   # Lo (Letter, other)
print(unicodedata.script(char))     # Thai  (Python 3.14+)

# Check if a character is Thai
def is_thai(ch: str) -> bool:
    return "\u0E01" <= ch <= "\u0E5B"

# Count Thai characters in a string
text = "Hello สวัสดี World"
thai_chars = [c for c in text if is_thai(c)]
print(len(thai_chars))  # 6 base + combining = all Thai code points

JavaScript

// Regex for Thai characters
const thaiPattern = /[\u0E00-\u0E7F]/;

function containsThai(text) {
  return thaiPattern.test(text);
}

// Extract Thai portions
const text = "Hello สวัสดี World";
const thaiOnly = text.replace(/[^\u0E00-\u0E7F]/g, "");
console.log(thaiOnly); // "สวัสดี"

// String length — combining marks count as separate code points
console.log("กุ่".length); // 3 (consonant + below vowel + tone mark)
console.log([..."กุ่"].length); // 3

Sorting Thai Text

Thai sorting follows the Royal Institute of Thailand dictionary order. The standard collation sorts consonants in their traditional order (ก ข ฃ ค ... ฮ), then by vowel, then by tone mark. ICU provides a Thai-aware collator:

import icu

collator = icu.Collator.createInstance(icu.Locale("th_TH"))
words = ["ไก่", "กา", "ข้าว", "กุ้ง"]
sorted_words = sorted(words, key=collator.getSortKey)
print(sorted_words)  # Thai dictionary order

Key Takeaways

  • Thai script is an abugida with 44 consonants, 32 vowel forms, and 4 tone marks, all in a single Unicode block (U+0E00–U+0E7F, 87 characters).
  • Vowels can appear above, below, before, or after their consonant — rendering engines must handle multi-level stacking.
  • Thai text has no spaces between words, making word segmentation a core challenge for search, line breaking, and NLP.
  • The consonant class system (high, mid, low) interacts with tone marks and vowel length to determine one of five tones — a complexity unique to Thai in Unicode.
  • Libraries like PyThaiNLP and ICU provide dictionary-based word segmentation that is essential for any application processing Thai text.
  • Thai characters are stored in logical order (consonant first, then marks), and rendering engines use OpenType shaping to position marks correctly.

Mehr in Script Stories