Thai Script
Thai is an abugida script with no spaces between words, complex vowel placement above and below consonants, and tone marks that affect meaning — all of which require sophisticated Unicode rendering. This guide explores the Thai Unicode block, how Thai text is encoded and segmented, and the challenges of Thai natural language processing.
Thai script is one of the most distinctive writing systems encoded in Unicode. Used by over 60 million people in Thailand and recognized as one of the world's major scripts, Thai presents unique challenges for digital text processing: words are written without spaces, vowels can appear above, below, before, or after their consonant, and tone marks stacked above vowels alter meaning entirely. This guide explores the Thai Unicode block, how Thai text is encoded and rendered, and the computational challenges of working with Thai strings.
A Brief History of Thai Script
Thai script descended from the Khmer script, which itself traces back through Old Khmer to the South Indian Pallava script and ultimately to the Brahmi script of ancient India. King Ramkhamhaeng the Great is traditionally credited with creating the Thai writing system in 1283 CE, as described in the famous Ramkhamhaeng Inscription (the oldest surviving Thai text). Whether Ramkhamhaeng invented the script outright or formalized an existing system is debated, but the inscription establishes Thai as a distinct script by the late 13th century.
Over seven centuries, the script has remained remarkably stable. Modern Thai readers can still decipher the Ramkhamhaeng Inscription, though spelling conventions and some character forms have evolved. The script spread beyond Thailand to influence the development of Lao, which shares a common ancestor and looks visibly similar to Thai.
How Thai Script Works
Thai is an abugida (also called an alphasyllabary) — a writing system where consonant letters carry an inherent vowel that is modified by diacritical marks to indicate different vowels and tones.
Consonants
Thai has 44 consonant letters representing 21 distinct consonant sounds. The surplus exists because multiple letters can represent the same sound but belong to different consonant classes (high, mid, low), which determine the tone of the syllable:
| Class | Count | Example Letters | Effect on Tone |
|---|---|---|---|
| High class | 11 | ข (kho khai), ฉ (cho ching), ถ (tho thung) | Rising tone in live syllables |
| Mid class | 9 | ก (ko kai), จ (cho chan), ด (do chada) | Mid tone in live syllables |
| Low class | 24 | ค (kho khwai), ง (ngo ngu), ช (cho chang) | Mid tone in live syllables |
This three-class system is central to Thai phonology and has no parallel in Latin-based writing. The class of the initial consonant, combined with the vowel length, syllable type (live or dead), and any explicit tone mark, determines which of the five tones (mid, low, falling, high, rising) a syllable carries.
Vowels
Thai has approximately 32 vowel forms that are written in four different positions relative to the consonant:
| Position | Example | Description |
|---|---|---|
| Before (left) | เ- | The vowel symbol appears before the consonant it follows phonetically |
| After (right) | -า | Written after the consonant, read after |
| Above | -ิ, -ี, -ึ, -ื | Written above the consonant |
| Below | -ุ, -ู | Written below the consonant |
Some vowels are composite, combining symbols from multiple positions. For example, เ-ือ (ua) uses a symbol before the consonant, one above, and one after. The key insight is that the visual order of Thai text does not always match the phonological order — a vowel written to the left of a consonant is actually pronounced after it.
Tone Marks
Thai has four explicit tone marks that appear above the consonant (or above a vowel that sits above the consonant):
| Mark | Name | Unicode | Effect |
|---|---|---|---|
| ่ | Mai ek | U+0E48 | Low tone (with mid-class consonant) |
| ้ | Mai tho | U+0E49 | Falling tone (with mid-class consonant) |
| ๊ | Mai tri | U+0E4A | High tone (with mid-class consonant) |
| ๋ | Mai chattawa | U+0E4B | Rising tone (with mid-class consonant) |
When a syllable has both an above-vowel and a tone mark, the tone mark stacks on top of the vowel, creating a three-level vertical cluster: consonant at the base, vowel above, tone mark on top. This stacking is critical for rendering engines.
No Spaces Between Words
Unlike English, Thai does not use spaces to separate words. A Thai sentence is a continuous stream of characters, and word boundaries must be inferred from context. Spaces in Thai are used only between sentences or clauses (loosely, like paragraph breaks).
For example, the phrase "Thai language" is written:
ภาษาไทย
This is two words (ภาษา = language, ไทย = Thai) written without any separator. A reader or algorithm must know the vocabulary to identify the boundary.
The Unicode Thai Block
Thai script is encoded in a single contiguous Unicode block:
| Block | Range | Characters |
|---|---|---|
| Thai | U+0E00 – U+0E7F | 87 assigned characters |
The block is organized logically:
| Range | Content | Count |
|---|---|---|
| U+0E01 – U+0E2E | Consonants | 44 |
| U+0E2F | Character Pai Yan Noi (abbreviation) | 1 |
| U+0E30 – U+0E3A | Vowels and virama | 11 |
| U+0E40 – U+0E45 | Leading vowels | 6 |
| U+0E46 | Character Mai Yamok (repetition) | 1 |
| U+0E47 – U+0E4E | Vowels and tone marks (above) | 8 |
| U+0E4F | Fongman (ornamental mark) | 1 |
| U+0E50 – U+0E59 | Thai digits 0–9 | 10 |
| U+0E5A – U+0E5B | Angkhankhu, Khomut (punctuation) | 2 |
Thai Digits
Thai has its own set of digits, though Arabic (Western) digits are also widely used in modern Thailand:
| Thai | Arabic | Code Point |
|---|---|---|
| ๐ | 0 | U+0E50 |
| ๑ | 1 | U+0E51 |
| ๒ | 2 | U+0E52 |
| ๓ | 3 | U+0E53 |
| ๔ | 4 | U+0E54 |
| ๕ | 5 | U+0E55 |
| ๖ | 6 | U+0E56 |
| ๗ | 7 | U+0E57 |
| ๘ | 8 | U+0E58 |
| ๙ | 9 | U+0E59 |
Character Encoding and Rendering
Storage Order vs. Visual Order
Thai text is stored in logical order — consonant first, then any above/below vowels and tone marks. The rendering engine is responsible for placing the marks in the correct visual positions:
Stored: ก (U+0E01) + ุ (U+0E38) + ่ (U+0E48)
Rendered: กุ่ (consonant with vowel below and tone mark above)
For leading vowels (those written to the left of the consonant), the vowel is stored before the consonant in the character stream, matching visual order:
Stored: เ (U+0E40) + ก (U+0E01)
Rendered: เก (vowel appears left of consonant)
This design means that caret movement and text selection in Thai follow the storage order, which generally matches left-to-right visual order.
Stacking Rules
Thai rendering requires handling up to three vertical levels above a consonant:
- Level 0: Base consonant
- Level 1: Above-vowel (e.g., -ิ U+0E34, -ี U+0E35)
- Level 2: Tone mark (e.g., ่ U+0E48) or thanthakhat (U+0E4C)
Below the consonant, there is one level for below-vowels (-ุ U+0E38, -ู U+0E39).
Rendering engines (such as HarfBuzz or the Windows Uniscribe shaper) use OpenType GPOS (Glyph Positioning) tables to correctly position these stacked marks. Without proper shaping, marks may overlap or appear at incorrect positions.
Common Rendering Issues
| Problem | Cause | Solution |
|---|---|---|
| Marks overlap consonant | Missing GPOS data in font | Use a font with Thai OpenType tables |
| Tone mark misplaced | Shaping engine not Thai-aware | Use HarfBuzz or equivalent shaper |
| Leading vowel detached | Line break inserted between vowel and consonant | Apply Thai line-breaking rules |
| Characters display as boxes | Font lacks Thai glyphs | Install a Thai-supporting font (Noto Sans Thai, Sarabun) |
Word Segmentation: The Central Challenge
The absence of spaces between words makes Thai word segmentation one of the hardest NLP problems for the language. Every operation that depends on word boundaries — search, spell checking, line breaking, text-to-speech — requires a segmentation algorithm.
Dictionary-Based Segmentation
The most common approach uses a dictionary of known Thai words to find the maximal matching segmentation:
# Using PyThaiNLP — the standard Thai NLP library
from pythainlp.tokenize import word_tokenize
text = "ภาษาไทยไม่มีช่องว่างระหว่างคำ"
words = word_tokenize(text, engine="newmm")
print(words)
# ['ภาษาไทย', 'ไม่', 'มี', 'ช่องว่าง', 'ระหว่าง', 'คำ']
ICU and Line Breaking
The Unicode Consortium provides the Thai line-breaking algorithm as part of ICU (International Components for Unicode), which uses a dictionary-based approach to identify safe line-break positions within Thai text:
import icu # PyICU
bi = icu.BreakIterator.createWordInstance(icu.Locale("th_TH"))
bi.setText("ภาษาไทยไม่มีช่องว่างระหว่างคำ")
boundaries = []
pos = bi.nextBoundary()
while pos != icu.BreakIterator.DONE:
boundaries.append(pos)
pos = bi.nextBoundary()
print(boundaries)
# Word boundary positions in the string
Web Browsers and CSS
Modern browsers implement Thai line-breaking internally. The CSS word-break property
interacts with Thai text:
/* Allow line breaks within Thai "words" (the browser determines boundaries) */
.thai-text {
word-break: break-word;
overflow-wrap: break-word;
line-height: 1.8; /* Extra line height for above/below marks */
}
Working with Thai Text in Code
Python
import unicodedata
# Thai character properties
char = "\u0E01" # ก (ko kai)
print(unicodedata.name(char)) # THAI CHARACTER KO KAI
print(unicodedata.category(char)) # Lo (Letter, other)
print(unicodedata.script(char)) # Thai (Python 3.14+)
# Check if a character is Thai
def is_thai(ch: str) -> bool:
return "\u0E01" <= ch <= "\u0E5B"
# Count Thai characters in a string
text = "Hello สวัสดี World"
thai_chars = [c for c in text if is_thai(c)]
print(len(thai_chars)) # 6 base + combining = all Thai code points
JavaScript
// Regex for Thai characters
const thaiPattern = /[\u0E00-\u0E7F]/;
function containsThai(text) {
return thaiPattern.test(text);
}
// Extract Thai portions
const text = "Hello สวัสดี World";
const thaiOnly = text.replace(/[^\u0E00-\u0E7F]/g, "");
console.log(thaiOnly); // "สวัสดี"
// String length — combining marks count as separate code points
console.log("กุ่".length); // 3 (consonant + below vowel + tone mark)
console.log([..."กุ่"].length); // 3
Sorting Thai Text
Thai sorting follows the Royal Institute of Thailand dictionary order. The standard collation sorts consonants in their traditional order (ก ข ฃ ค ... ฮ), then by vowel, then by tone mark. ICU provides a Thai-aware collator:
import icu
collator = icu.Collator.createInstance(icu.Locale("th_TH"))
words = ["ไก่", "กา", "ข้าว", "กุ้ง"]
sorted_words = sorted(words, key=collator.getSortKey)
print(sorted_words) # Thai dictionary order
Key Takeaways
- Thai script is an abugida with 44 consonants, 32 vowel forms, and 4 tone marks, all in a single Unicode block (U+0E00–U+0E7F, 87 characters).
- Vowels can appear above, below, before, or after their consonant — rendering engines must handle multi-level stacking.
- Thai text has no spaces between words, making word segmentation a core challenge for search, line breaking, and NLP.
- The consonant class system (high, mid, low) interacts with tone marks and vowel length to determine one of five tones — a complexity unique to Thai in Unicode.
- Libraries like PyThaiNLP and ICU provide dictionary-based word segmentation that is essential for any application processing Thai text.
- Thai characters are stored in logical order (consonant first, then marks), and rendering engines use OpenType shaping to position marks correctly.
Más en Script Stories
Arabic is the third most widely used writing system in the world, …
Devanagari is an abugida script used to write Hindi, Sanskrit, Marathi, and …
Greek is one of the oldest alphabetic writing systems and gave Unicode …
Cyrillic is used to write Russian, Ukrainian, Bulgarian, Serbian, and over 50 …
Hebrew is an abjad script written right-to-left, used for Biblical Hebrew, Modern …
Japanese is unique in using three scripts simultaneously — Hiragana, Katakana, and …
Hangul was invented in 1443 by King Sejong as a scientific alphabet …
Bengali is an abugida script with over 300 million speakers, used for …
Tamil is one of the oldest living writing systems, with a literary …
The Armenian alphabet was created in 405 AD by the monk Mesrop …
Georgian has three distinct historical scripts — Mkhedruli, Asomtavruli, and Nuskhuri — …
The Ethiopic script (Ge'ez) is an abugida used to write Amharic, Tigrinya, …
Unicode encodes dozens of historic and extinct scripts — from Cuneiform and …
There are hundreds of writing systems in use around the world today, …