📜 Script Stories

Tamil Script

Tamil is one of the oldest living writing systems, with a literary tradition spanning over 2,000 years, and its script encodes a relatively small set of characters that combine to form a large syllabic inventory. This guide explores the Tamil Unicode block, its classical and modern character sets, and considerations for Tamil text in digital applications.

Published 2023-09-25 · Updated 2025-05-12

Tamil is one of the longest-surviving classical languages in the world, with a literary tradition spanning over two millennia. The Tamil script, used by more than 80 million people primarily in Tamil Nadu (India), Sri Lanka, Singapore, and the Tamil diaspora worldwide, has a distinctive structure that sets it apart from other Indian scripts — most notably, it lacks aspirated consonants and uses a relatively small character inventory that combines to produce a large syllabic repertoire. This guide explores the Tamil script's unique structure, its Unicode encoding, and the challenges of digital Tamil text processing.

Historical Background

Tamil script evolved from the ancient Brahmi script through a southern variant known as Tamil-Brahmi (Tamili), attested from the 3rd century BCE in cave inscriptions. The script underwent significant development:

Period	Script Form	Key Development
3rd century BCE	Tamil-Brahmi	Earliest inscriptions (cave writing)
4th–6th century	Vatteluttu	Rounded cursive form develops
7th–9th century	Pallava/Chola script	Used alongside Vatteluttu
12th century	Modern Tamil emerges	Simplified, current form takes shape
19th century	Print standardization	Typeface conventions established

A pivotal moment came in the 1970s–1980s when the Tamil Nadu government simplified the script by reducing the number of compound character forms, making it one of the more streamlined Indic scripts.

Script Structure

Tamil is an abugida — consonant letters carry an inherent vowel /a/ that is modified by diacritical vowel marks. However, Tamil differs from most other Indic scripts in several important ways.

Vowels (உயிரெழுத்து)

Tamil has 12 vowels (5 short + 5 long pairs, plus 2 diphthongs):

Short	Long	Name (Short)	Unicode
அ (a)	ஆ (aa)	a	U+0B85, U+0B86
இ (i)	ஈ (ii)	i	U+0B87, U+0B88
உ (u)	ஊ (uu)	u	U+0B89, U+0B8A
எ (e)	ஏ (ee)	e	U+0B8E, U+0B8F
ஒ (o)	ஓ (oo)	o	U+0B92, U+0B93
—	ஐ (ai)	ai (diphthong)	U+0B90
—	ஔ (au)	au (diphthong)	U+0B94

The vowel length distinction (short vs. long) is phonemically significant — changing vowel length changes meaning. The letter ஃ (aytham, U+0B83) is a special character sometimes classified as a vowel, representing a voiceless sound.

Consonants (மெய்யெழுத்து)

Tamil has only 18 consonants — far fewer than most Indic scripts. The critical difference is that Tamil does not distinguish between voiced and unvoiced stops or between aspirated and unaspirated stops in its writing system:

Category	Letters	Count	Contrast with Hindi
Vallinam (hard)	க ச ட த ப ற	6	Hindi has 4x as many stops
Mellinam (soft/nasal)	ங ஞ ண ந ம ன	6	Similar nasal inventory
Idaiyinam (medial)	ய ர ல வ ழ ள	6	Plus ழ (retroflex approximant) unique to Tamil

In Hindi's Devanagari script, each stop consonant position has four variants (unvoiced, unvoiced aspirated, voiced, voiced aspirated). Tamil collapses these into a single letter — the actual pronunciation (voiced or unvoiced) is determined by phonological context, not by spelling. For example, the letter க represents /k/ at the start of a word but /g/ between vowels.

This means Tamil needs far fewer Unicode code points for its consonant inventory but requires contextual phonological knowledge for correct pronunciation.

The Grantha Consonants

For writing Sanskrit loanwords that contain sounds not native to Tamil, the script includes six Grantha consonants:

Letter	Sound	Unicode	Usage
ஜ	ja	U+0B9C	Sanskrit/Hindi loanwords
ஶ	sha	U+0BB6	Sanskrit loanwords
ஷ	ssa	U+0BB7	Sanskrit loanwords
ஸ	sa	U+0BB8	Sanskrit loanwords
ஹ	ha	U+0BB9	Sanskrit loanwords
க்ஷ	ksha	—	Conjunct of க + ஷ

These are used primarily in proper nouns, technical terms, and religious texts.

Vowel Signs (உயிர்மெய்யெழுத்து)

When a vowel follows a consonant, it is written as a vowel sign (dependent form) attached to the consonant:

Vowel	Sign on க	Result	Unicode (Sign)
a	(inherent)	க	—
aa	கா	kaa	U+0BBE
i	கி	ki	U+0BBF
ii	கீ	kii	U+0BC0
u	கு	ku	U+0BC1
uu	கூ	kuu	U+0BC2
e	கெ	ke	U+0BC6
ee	கே	kee	U+0BC7
ai	கை	kai	U+0BC8
o	கொ	ko	U+0BCA
oo	கோ	koo	U+0BCB
au	கௌ	kau	U+0BCC

The vowel signs for கொ (o), கோ (oo), and கௌ (au) are two-part signs — they consist of a left component and a right component that wrap around the consonant. In Unicode, these are encoded as single code points (U+0BCA, U+0BCB, U+0BCC), but at the canonical decomposition level:

கொ = கெ + ா  →  U+0B95 + U+0BCA  (composed)
கொ = கெ + ா  →  U+0B95 + U+0BC6 + U+0BBE  (decomposed: left-part + right-part)

Pulli (புள்ளி) — The Vowel Suppressor

To indicate a pure consonant without the inherent /a/ vowel, Tamil uses the pulli (a dot above the consonant, analogous to the virama in other Indic scripts):

க = ka (with inherent vowel)
க் = k  (pure consonant, pulli above)

The pulli is encoded as U+0BCD (TAMIL SIGN VIRAMA). In modern Tamil, the pulli is always visible when a consonant stands alone — unlike some North Indian scripts where the virama may trigger a conjunct ligature instead.

Combination Grid

The full Tamil syllabary is generated by combining 18 consonants with 12 vowels, plus the pure consonant form:

18 consonants x (12 vowels + 1 pure form) = 18 x 13 = 234 combinations

Add the 12 independent vowels, ஃ (aytham), and the Grantha consonants with their combinations, and the total practical inventory is around 300 distinct syllabic forms — all generated from fewer than 60 Unicode code points.

The Unicode Tamil Block

Block	Range	Characters
Tamil	U+0B80 – U+0BFF	72 assigned

The Tamil block is notably sparse compared to other Indic blocks — many code points in the range are unassigned, reflecting Tamil's smaller character inventory:

Range	Content	Count
U+0B82 – U+0B83	Anusvara, Aytham	2
U+0B85 – U+0B94	Independent vowels	12
U+0B95 – U+0BB9	Consonants	18 + 5 Grantha
U+0BBE – U+0BCC	Vowel signs	11
U+0BCD	Pulli (virama)	1
U+0BD0	Tamil OM	1
U+0BD7	AU length mark	1
U+0BE6 – U+0BEF	Tamil digits	10
U+0BF0 – U+0BF2	Tamil numbers (10, 100, 1000)	3
U+0BF3 – U+0BFA	Tamil symbols (day, month, year, etc.)	8

Tamil Digits

Tamil has its own digit system:

Tamil	Value	Code Point
௦	0	U+0BE6
௧	1	U+0BE7
௨	2	U+0BE8
௩	3	U+0BE9
௪	4	U+0BEA
௫	5	U+0BEB
௬	6	U+0BEC
௭	7	U+0BED
௮	8	U+0BEE
௯	9	U+0BEF

Tamil also has special number signs for 10 (௰, U+0BF0), 100 (௱, U+0BF1), and 1000 (௲, U+0BF2), remnants of an older non-positional number system.

Tamil Symbols

The Tamil block includes unique symbols not found in other Indic blocks:

Symbol	Name	Unicode	Usage
௳	Day sign	U+0BF3	Traditional calendar
௴	Month sign	U+0BF4	Traditional calendar
௵	Year sign	U+0BF5	Traditional calendar
௶	Debit sign	U+0BF6	Accounting
௷	Credit sign	U+0BF7	Accounting
௸	As-above sign	U+0BF8	Ditto mark
௹	Rupee sign	U+0BF9	Currency
௺	Number sign	U+0BFA	Numbering

Rendering Tamil Text

Simpler Than Other Indic Scripts

Tamil rendering is considerably simpler than Bengali or Devanagari because modern Tamil has very few conjunct consonant forms. After the script reforms, most consonant clusters are written with an explicit pulli rather than forming ligatures:

Standard:  ந + ் + த = ந்த  (nt — pulli visible on ந, then த)
Contrast:  In Devanagari, न + ् + त → न्त (a conjunct ligature)

The main exceptions where special rendering is needed:

Cluster	Visual	Notes
க + ் + ஷ	க்ஷ	ksha — traditional conjunct
ஸ + ் + ரீ	ஸ்ரீ	Shri — common in proper nouns

Left-Position Vowel Signs

Like Bengali, Tamil has vowel signs that appear to the left of the consonant but are stored after it in Unicode:

Stored:  க (U+0B95) + ெ (U+0BC6)
Rendered: கெ  (vowel sign appears to the left of க)

The rendering engine must reorder these for correct display.

Working with Tamil in Code

Python

import unicodedata

# Tamil character properties
char = "\u0B95"  # க (ka)
print(unicodedata.name(char))      # TAMIL LETTER KA
print(unicodedata.category(char))  # Lo (Letter, other)

# Enumerate the Tamil consonant + vowel grid
CONSONANTS = [chr(c) for c in range(0x0B95, 0x0BBA) if unicodedata.name(chr(c), None)]
VOWEL_SIGNS = {
    "a": "",
    "aa": "\u0BBE",
    "i": "\u0BBF",
    "ii": "\u0BC0",
    "u": "\u0BC1",
    "uu": "\u0BC2",
}

# Generate ka-row
ka = "\u0B95"
for name, sign in VOWEL_SIGNS.items():
    syllable = ka + sign
    print(f"  {name}: {syllable}")

JavaScript

// Tamil Unicode range detection
const tamilPattern = /[\u0B80-\u0BFF]/;

function containsTamil(text) {
  return tamilPattern.test(text);
}

// Grapheme segmentation for Tamil
const segmenter = new Intl.Segmenter("ta", { granularity: "grapheme" });
const text = "தமிழ்";  // "Tamil"
const graphemes = [...segmenter.segment(text)];
console.log(graphemes.map(g => g.segment));
// Individual grapheme clusters (handles pulli correctly)

Sorting Tamil Text

Tamil sorting follows the traditional vowel-consonant order: vowels first (அ ஆ இ ஈ ...), then consonants in their systematic order (க ங ச ஞ ...), then consonant-vowel combinations. ICU provides proper Tamil collation:

import icu

collator = icu.Collator.createInstance(icu.Locale("ta_IN"))
words = ["தமிழ்", "அம்மா", "நன்றி", "கடல்"]
sorted_words = sorted(words, key=collator.getSortKey)
print(sorted_words)  # Tamil dictionary order

TSCII and Legacy Encodings

Before Unicode, Tamil computing was plagued by numerous incompatible encodings:

Encoding	Origin	Status
TSCII (Tamil Script Code for Information Interchange)	Tamil Internet community, 1999	Still used in some contexts
TAB (Tamil Brahmi)	Tamil Nadu government	Deprecated
TAM (ISCII Tamil)	Indian government standard	Rarely used
Bamini	Sri Lankan Tamil community	Still used in diaspora
Various font encodings	Individual font vendors	Widespread legacy data

The proliferation of font-based encodings — where Tamil text was stored as Latin characters mapped to Tamil glyphs by specific fonts — created massive interoperability problems. Converting legacy Tamil data to Unicode remains an active challenge:

# Conceptual: TSCII to Unicode conversion (simplified)
TSCII_TO_UNICODE = {
    0xA1: "\u0B85",  # அ
    0xA2: "\u0B86",  # ஆ
    0xA3: "\u0B87",  # இ
    # ... hundreds of mappings
}

def tscii_to_unicode(data: bytes) -> str:
    return "".join(TSCII_TO_UNICODE.get(b, chr(b)) for b in data)

Tamil in Modern Digital Context

Web Typography

Tamil text requires extra line height due to vowel signs above and below consonants:

.tamil-text {
  font-family: "Noto Sans Tamil", "Latha", "Vijaya", sans-serif;
  line-height: 1.8;
  font-size: 1.1em; /* Tamil glyphs benefit from slightly larger size */
}

Tamil Domain Names

Tamil is supported in Internationalized Domain Names. The .இந்தியா (India in Tamil) and .சிங்கப்பூர் (Singapore in Tamil) TLDs exist, though adoption remains limited.

Tamil Unicode Consortium Representation

The Tamil Virtual Academy and the Tamil Nadu government have been active participants in Unicode standards work, successfully advocating for the inclusion of Tamil-specific symbols (calendar signs, traditional numerals) and ensuring the Tamil block meets the needs of modern Tamil computing.

Key Takeaways

Tamil script has only 18 consonants (no aspirated/voiced distinction in writing) and 12 vowels, making it one of the most compact Indic scripts — encoded in just 72 characters in the Unicode Tamil block (U+0B80–U+0BFF).
The pulli (U+0BCD, virama) creates pure consonants and is always visually displayed in modern Tamil, unlike viramas in many other Indic scripts that trigger ligatures.
Tamil's consonant-vowel grid generates ~300 syllabic forms from fewer than 60 code points — a compact and systematic encoding.
Two-part vowel signs (கொ, கோ, கௌ) appear on both sides of the consonant and are encoded as single code points that decompose into left and right components.
Tamil rendering is simpler than most Indic scripts due to minimal conjunct forms after the 20th-century script reforms.
Legacy encoding conversion (TSCII, Bamini, font-encoded data) remains a significant challenge for Tamil digital preservation and data migration.