Tamil Script
Tamil is one of the oldest living writing systems, with a literary tradition spanning over 2,000 years, and its script encodes a relatively small set of characters that combine to form a large syllabic inventory. This guide explores the Tamil Unicode block, its classical and modern character sets, and considerations for Tamil text in digital applications.
Tamil is one of the longest-surviving classical languages in the world, with a literary tradition spanning over two millennia. The Tamil script, used by more than 80 million people primarily in Tamil Nadu (India), Sri Lanka, Singapore, and the Tamil diaspora worldwide, has a distinctive structure that sets it apart from other Indian scripts — most notably, it lacks aspirated consonants and uses a relatively small character inventory that combines to produce a large syllabic repertoire. This guide explores the Tamil script's unique structure, its Unicode encoding, and the challenges of digital Tamil text processing.
Historical Background
Tamil script evolved from the ancient Brahmi script through a southern variant known as Tamil-Brahmi (Tamili), attested from the 3rd century BCE in cave inscriptions. The script underwent significant development:
| Period | Script Form | Key Development |
|---|---|---|
| 3rd century BCE | Tamil-Brahmi | Earliest inscriptions (cave writing) |
| 4th–6th century | Vatteluttu | Rounded cursive form develops |
| 7th–9th century | Pallava/Chola script | Used alongside Vatteluttu |
| 12th century | Modern Tamil emerges | Simplified, current form takes shape |
| 19th century | Print standardization | Typeface conventions established |
A pivotal moment came in the 1970s–1980s when the Tamil Nadu government simplified the script by reducing the number of compound character forms, making it one of the more streamlined Indic scripts.
Script Structure
Tamil is an abugida — consonant letters carry an inherent vowel /a/ that is modified by diacritical vowel marks. However, Tamil differs from most other Indic scripts in several important ways.
Vowels (உயிரெழுத்து)
Tamil has 12 vowels (5 short + 5 long pairs, plus 2 diphthongs):
| Short | Long | Name (Short) | Unicode |
|---|---|---|---|
| அ (a) | ஆ (aa) | a | U+0B85, U+0B86 |
| இ (i) | ஈ (ii) | i | U+0B87, U+0B88 |
| உ (u) | ஊ (uu) | u | U+0B89, U+0B8A |
| எ (e) | ஏ (ee) | e | U+0B8E, U+0B8F |
| ஒ (o) | ஓ (oo) | o | U+0B92, U+0B93 |
| — | ஐ (ai) | ai (diphthong) | U+0B90 |
| — | ஔ (au) | au (diphthong) | U+0B94 |
The vowel length distinction (short vs. long) is phonemically significant — changing vowel length changes meaning. The letter ஃ (aytham, U+0B83) is a special character sometimes classified as a vowel, representing a voiceless sound.
Consonants (மெய்யெழுத்து)
Tamil has only 18 consonants — far fewer than most Indic scripts. The critical difference is that Tamil does not distinguish between voiced and unvoiced stops or between aspirated and unaspirated stops in its writing system:
| Category | Letters | Count | Contrast with Hindi |
|---|---|---|---|
| Vallinam (hard) | க ச ட த ப ற | 6 | Hindi has 4x as many stops |
| Mellinam (soft/nasal) | ங ஞ ண ந ம ன | 6 | Similar nasal inventory |
| Idaiyinam (medial) | ய ர ல வ ழ ள | 6 | Plus ழ (retroflex approximant) unique to Tamil |
In Hindi's Devanagari script, each stop consonant position has four variants (unvoiced, unvoiced aspirated, voiced, voiced aspirated). Tamil collapses these into a single letter — the actual pronunciation (voiced or unvoiced) is determined by phonological context, not by spelling. For example, the letter க represents /k/ at the start of a word but /g/ between vowels.
This means Tamil needs far fewer Unicode code points for its consonant inventory but requires contextual phonological knowledge for correct pronunciation.
The Grantha Consonants
For writing Sanskrit loanwords that contain sounds not native to Tamil, the script includes six Grantha consonants:
| Letter | Sound | Unicode | Usage |
|---|---|---|---|
| ஜ | ja | U+0B9C | Sanskrit/Hindi loanwords |
| ஶ | sha | U+0BB6 | Sanskrit loanwords |
| ஷ | ssa | U+0BB7 | Sanskrit loanwords |
| ஸ | sa | U+0BB8 | Sanskrit loanwords |
| ஹ | ha | U+0BB9 | Sanskrit loanwords |
| க்ஷ | ksha | — | Conjunct of க + ஷ |
These are used primarily in proper nouns, technical terms, and religious texts.
Vowel Signs (உயிர்மெய்யெழுத்து)
When a vowel follows a consonant, it is written as a vowel sign (dependent form) attached to the consonant:
| Vowel | Sign on க | Result | Unicode (Sign) |
|---|---|---|---|
| a | (inherent) | க | — |
| aa | கா | kaa | U+0BBE |
| i | கி | ki | U+0BBF |
| ii | கீ | kii | U+0BC0 |
| u | கு | ku | U+0BC1 |
| uu | கூ | kuu | U+0BC2 |
| e | கெ | ke | U+0BC6 |
| ee | கே | kee | U+0BC7 |
| ai | கை | kai | U+0BC8 |
| o | கொ | ko | U+0BCA |
| oo | கோ | koo | U+0BCB |
| au | கௌ | kau | U+0BCC |
The vowel signs for கொ (o), கோ (oo), and கௌ (au) are two-part signs — they consist of a left component and a right component that wrap around the consonant. In Unicode, these are encoded as single code points (U+0BCA, U+0BCB, U+0BCC), but at the canonical decomposition level:
கொ = கெ + ா → U+0B95 + U+0BCA (composed)
கொ = கெ + ா → U+0B95 + U+0BC6 + U+0BBE (decomposed: left-part + right-part)
Pulli (புள்ளி) — The Vowel Suppressor
To indicate a pure consonant without the inherent /a/ vowel, Tamil uses the pulli (a dot above the consonant, analogous to the virama in other Indic scripts):
க = ka (with inherent vowel)
க் = k (pure consonant, pulli above)
The pulli is encoded as U+0BCD (TAMIL SIGN VIRAMA). In modern Tamil, the pulli is always visible when a consonant stands alone — unlike some North Indian scripts where the virama may trigger a conjunct ligature instead.
Combination Grid
The full Tamil syllabary is generated by combining 18 consonants with 12 vowels, plus the pure consonant form:
18 consonants x (12 vowels + 1 pure form) = 18 x 13 = 234 combinations
Add the 12 independent vowels, ஃ (aytham), and the Grantha consonants with their combinations, and the total practical inventory is around 300 distinct syllabic forms — all generated from fewer than 60 Unicode code points.
The Unicode Tamil Block
| Block | Range | Characters |
|---|---|---|
| Tamil | U+0B80 – U+0BFF | 72 assigned |
The Tamil block is notably sparse compared to other Indic blocks — many code points in the range are unassigned, reflecting Tamil's smaller character inventory:
| Range | Content | Count |
|---|---|---|
| U+0B82 – U+0B83 | Anusvara, Aytham | 2 |
| U+0B85 – U+0B94 | Independent vowels | 12 |
| U+0B95 – U+0BB9 | Consonants | 18 + 5 Grantha |
| U+0BBE – U+0BCC | Vowel signs | 11 |
| U+0BCD | Pulli (virama) | 1 |
| U+0BD0 | Tamil OM | 1 |
| U+0BD7 | AU length mark | 1 |
| U+0BE6 – U+0BEF | Tamil digits | 10 |
| U+0BF0 – U+0BF2 | Tamil numbers (10, 100, 1000) | 3 |
| U+0BF3 – U+0BFA | Tamil symbols (day, month, year, etc.) | 8 |
Tamil Digits
Tamil has its own digit system:
| Tamil | Value | Code Point |
|---|---|---|
| ௦ | 0 | U+0BE6 |
| ௧ | 1 | U+0BE7 |
| ௨ | 2 | U+0BE8 |
| ௩ | 3 | U+0BE9 |
| ௪ | 4 | U+0BEA |
| ௫ | 5 | U+0BEB |
| ௬ | 6 | U+0BEC |
| ௭ | 7 | U+0BED |
| ௮ | 8 | U+0BEE |
| ௯ | 9 | U+0BEF |
Tamil also has special number signs for 10 (௰, U+0BF0), 100 (௱, U+0BF1), and 1000 (௲, U+0BF2), remnants of an older non-positional number system.
Tamil Symbols
The Tamil block includes unique symbols not found in other Indic blocks:
| Symbol | Name | Unicode | Usage |
|---|---|---|---|
| ௳ | Day sign | U+0BF3 | Traditional calendar |
| ௴ | Month sign | U+0BF4 | Traditional calendar |
| ௵ | Year sign | U+0BF5 | Traditional calendar |
| ௶ | Debit sign | U+0BF6 | Accounting |
| ௷ | Credit sign | U+0BF7 | Accounting |
| ௸ | As-above sign | U+0BF8 | Ditto mark |
| ௹ | Rupee sign | U+0BF9 | Currency |
| ௺ | Number sign | U+0BFA | Numbering |
Rendering Tamil Text
Simpler Than Other Indic Scripts
Tamil rendering is considerably simpler than Bengali or Devanagari because modern Tamil has very few conjunct consonant forms. After the script reforms, most consonant clusters are written with an explicit pulli rather than forming ligatures:
Standard: ந + ் + த = ந்த (nt — pulli visible on ந, then த)
Contrast: In Devanagari, न + ् + त → न्त (a conjunct ligature)
The main exceptions where special rendering is needed:
| Cluster | Visual | Notes |
|---|---|---|
| க + ் + ஷ | க்ஷ | ksha — traditional conjunct |
| ஸ + ் + ரீ | ஸ்ரீ | Shri — common in proper nouns |
Left-Position Vowel Signs
Like Bengali, Tamil has vowel signs that appear to the left of the consonant but are stored after it in Unicode:
Stored: க (U+0B95) + ெ (U+0BC6)
Rendered: கெ (vowel sign appears to the left of க)
The rendering engine must reorder these for correct display.
Working with Tamil in Code
Python
import unicodedata
# Tamil character properties
char = "\u0B95" # க (ka)
print(unicodedata.name(char)) # TAMIL LETTER KA
print(unicodedata.category(char)) # Lo (Letter, other)
# Enumerate the Tamil consonant + vowel grid
CONSONANTS = [chr(c) for c in range(0x0B95, 0x0BBA) if unicodedata.name(chr(c), None)]
VOWEL_SIGNS = {
"a": "",
"aa": "\u0BBE",
"i": "\u0BBF",
"ii": "\u0BC0",
"u": "\u0BC1",
"uu": "\u0BC2",
}
# Generate ka-row
ka = "\u0B95"
for name, sign in VOWEL_SIGNS.items():
syllable = ka + sign
print(f" {name}: {syllable}")
JavaScript
// Tamil Unicode range detection
const tamilPattern = /[\u0B80-\u0BFF]/;
function containsTamil(text) {
return tamilPattern.test(text);
}
// Grapheme segmentation for Tamil
const segmenter = new Intl.Segmenter("ta", { granularity: "grapheme" });
const text = "தமிழ்"; // "Tamil"
const graphemes = [...segmenter.segment(text)];
console.log(graphemes.map(g => g.segment));
// Individual grapheme clusters (handles pulli correctly)
Sorting Tamil Text
Tamil sorting follows the traditional vowel-consonant order: vowels first (அ ஆ இ ஈ ...), then consonants in their systematic order (க ங ச ஞ ...), then consonant-vowel combinations. ICU provides proper Tamil collation:
import icu
collator = icu.Collator.createInstance(icu.Locale("ta_IN"))
words = ["தமிழ்", "அம்மா", "நன்றி", "கடல்"]
sorted_words = sorted(words, key=collator.getSortKey)
print(sorted_words) # Tamil dictionary order
TSCII and Legacy Encodings
Before Unicode, Tamil computing was plagued by numerous incompatible encodings:
| Encoding | Origin | Status |
|---|---|---|
| TSCII (Tamil Script Code for Information Interchange) | Tamil Internet community, 1999 | Still used in some contexts |
| TAB (Tamil Brahmi) | Tamil Nadu government | Deprecated |
| TAM (ISCII Tamil) | Indian government standard | Rarely used |
| Bamini | Sri Lankan Tamil community | Still used in diaspora |
| Various font encodings | Individual font vendors | Widespread legacy data |
The proliferation of font-based encodings — where Tamil text was stored as Latin characters mapped to Tamil glyphs by specific fonts — created massive interoperability problems. Converting legacy Tamil data to Unicode remains an active challenge:
# Conceptual: TSCII to Unicode conversion (simplified)
TSCII_TO_UNICODE = {
0xA1: "\u0B85", # அ
0xA2: "\u0B86", # ஆ
0xA3: "\u0B87", # இ
# ... hundreds of mappings
}
def tscii_to_unicode(data: bytes) -> str:
return "".join(TSCII_TO_UNICODE.get(b, chr(b)) for b in data)
Tamil in Modern Digital Context
Web Typography
Tamil text requires extra line height due to vowel signs above and below consonants:
.tamil-text {
font-family: "Noto Sans Tamil", "Latha", "Vijaya", sans-serif;
line-height: 1.8;
font-size: 1.1em; /* Tamil glyphs benefit from slightly larger size */
}
Tamil Domain Names
Tamil is supported in Internationalized Domain Names. The .இந்தியா (India in Tamil) and
.சிங்கப்பூர் (Singapore in Tamil) TLDs exist, though adoption remains limited.
Tamil Unicode Consortium Representation
The Tamil Virtual Academy and the Tamil Nadu government have been active participants in Unicode standards work, successfully advocating for the inclusion of Tamil-specific symbols (calendar signs, traditional numerals) and ensuring the Tamil block meets the needs of modern Tamil computing.
Key Takeaways
- Tamil script has only 18 consonants (no aspirated/voiced distinction in writing) and 12 vowels, making it one of the most compact Indic scripts — encoded in just 72 characters in the Unicode Tamil block (U+0B80–U+0BFF).
- The pulli (U+0BCD, virama) creates pure consonants and is always visually displayed in modern Tamil, unlike viramas in many other Indic scripts that trigger ligatures.
- Tamil's consonant-vowel grid generates ~300 syllabic forms from fewer than 60 code points — a compact and systematic encoding.
- Two-part vowel signs (கொ, கோ, கௌ) appear on both sides of the consonant and are encoded as single code points that decompose into left and right components.
- Tamil rendering is simpler than most Indic scripts due to minimal conjunct forms after the 20th-century script reforms.
- Legacy encoding conversion (TSCII, Bamini, font-encoded data) remains a significant challenge for Tamil digital preservation and data migration.
Script Stories のその他のガイド
Arabic is the third most widely used writing system in the world, …
Devanagari is an abugida script used to write Hindi, Sanskrit, Marathi, and …
Greek is one of the oldest alphabetic writing systems and gave Unicode …
Cyrillic is used to write Russian, Ukrainian, Bulgarian, Serbian, and over 50 …
Hebrew is an abjad script written right-to-left, used for Biblical Hebrew, Modern …
Thai is an abugida script with no spaces between words, complex vowel …
Japanese is unique in using three scripts simultaneously — Hiragana, Katakana, and …
Hangul was invented in 1443 by King Sejong as a scientific alphabet …
Bengali is an abugida script with over 300 million speakers, used for …
The Armenian alphabet was created in 405 AD by the monk Mesrop …
Georgian has three distinct historical scripts — Mkhedruli, Asomtavruli, and Nuskhuri — …
The Ethiopic script (Ge'ez) is an abugida used to write Amharic, Tigrinya, …
Unicode encodes dozens of historic and extinct scripts — from Cuneiform and …
There are hundreds of writing systems in use around the world today, …