📜 Script Stories

Tamil Script

Tamil is one of the oldest living writing systems, with a literary tradition spanning over 2,000 years, and its script encodes a relatively small set of characters that combine to form a large syllabic inventory. This guide explores the Tamil Unicode block, its classical and modern character sets, and considerations for Tamil text in digital applications.

·

Tamil is one of the longest-surviving classical languages in the world, with a literary tradition spanning over two millennia. The Tamil script, used by more than 80 million people primarily in Tamil Nadu (India), Sri Lanka, Singapore, and the Tamil diaspora worldwide, has a distinctive structure that sets it apart from other Indian scripts — most notably, it lacks aspirated consonants and uses a relatively small character inventory that combines to produce a large syllabic repertoire. This guide explores the Tamil script's unique structure, its Unicode encoding, and the challenges of digital Tamil text processing.

Historical Background

Tamil script evolved from the ancient Brahmi script through a southern variant known as Tamil-Brahmi (Tamili), attested from the 3rd century BCE in cave inscriptions. The script underwent significant development:

Period Script Form Key Development
3rd century BCE Tamil-Brahmi Earliest inscriptions (cave writing)
4th–6th century Vatteluttu Rounded cursive form develops
7th–9th century Pallava/Chola script Used alongside Vatteluttu
12th century Modern Tamil emerges Simplified, current form takes shape
19th century Print standardization Typeface conventions established

A pivotal moment came in the 1970s–1980s when the Tamil Nadu government simplified the script by reducing the number of compound character forms, making it one of the more streamlined Indic scripts.

Script Structure

Tamil is an abugida — consonant letters carry an inherent vowel /a/ that is modified by diacritical vowel marks. However, Tamil differs from most other Indic scripts in several important ways.

Vowels (உயிரெழுத்து)

Tamil has 12 vowels (5 short + 5 long pairs, plus 2 diphthongs):

Short Long Name (Short) Unicode
அ (a) ஆ (aa) a U+0B85, U+0B86
இ (i) ஈ (ii) i U+0B87, U+0B88
உ (u) ஊ (uu) u U+0B89, U+0B8A
எ (e) ஏ (ee) e U+0B8E, U+0B8F
ஒ (o) ஓ (oo) o U+0B92, U+0B93
ஐ (ai) ai (diphthong) U+0B90
ஔ (au) au (diphthong) U+0B94

The vowel length distinction (short vs. long) is phonemically significant — changing vowel length changes meaning. The letter ஃ (aytham, U+0B83) is a special character sometimes classified as a vowel, representing a voiceless sound.

Consonants (மெய்யெழுத்து)

Tamil has only 18 consonants — far fewer than most Indic scripts. The critical difference is that Tamil does not distinguish between voiced and unvoiced stops or between aspirated and unaspirated stops in its writing system:

Category Letters Count Contrast with Hindi
Vallinam (hard) க ச ட த ப ற 6 Hindi has 4x as many stops
Mellinam (soft/nasal) ங ஞ ண ந ம ன 6 Similar nasal inventory
Idaiyinam (medial) ய ர ல வ ழ ள 6 Plus ழ (retroflex approximant) unique to Tamil

In Hindi's Devanagari script, each stop consonant position has four variants (unvoiced, unvoiced aspirated, voiced, voiced aspirated). Tamil collapses these into a single letter — the actual pronunciation (voiced or unvoiced) is determined by phonological context, not by spelling. For example, the letter க represents /k/ at the start of a word but /g/ between vowels.

This means Tamil needs far fewer Unicode code points for its consonant inventory but requires contextual phonological knowledge for correct pronunciation.

The Grantha Consonants

For writing Sanskrit loanwords that contain sounds not native to Tamil, the script includes six Grantha consonants:

Letter Sound Unicode Usage
ja U+0B9C Sanskrit/Hindi loanwords
sha U+0BB6 Sanskrit loanwords
ssa U+0BB7 Sanskrit loanwords
sa U+0BB8 Sanskrit loanwords
ha U+0BB9 Sanskrit loanwords
க்ஷ ksha Conjunct of க + ஷ

These are used primarily in proper nouns, technical terms, and religious texts.

Vowel Signs (உயிர்மெய்யெழுத்து)

When a vowel follows a consonant, it is written as a vowel sign (dependent form) attached to the consonant:

Vowel Sign on க Result Unicode (Sign)
a (inherent)
aa கா kaa U+0BBE
i கி ki U+0BBF
ii கீ kii U+0BC0
u கு ku U+0BC1
uu கூ kuu U+0BC2
e கெ ke U+0BC6
ee கே kee U+0BC7
ai கை kai U+0BC8
o கொ ko U+0BCA
oo கோ koo U+0BCB
au கௌ kau U+0BCC

The vowel signs for கொ (o), கோ (oo), and கௌ (au) are two-part signs — they consist of a left component and a right component that wrap around the consonant. In Unicode, these are encoded as single code points (U+0BCA, U+0BCB, U+0BCC), but at the canonical decomposition level:

கொ = கெ + ா  →  U+0B95 + U+0BCA  (composed)
கொ = கெ + ா  →  U+0B95 + U+0BC6 + U+0BBE  (decomposed: left-part + right-part)

Pulli (புள்ளி) — The Vowel Suppressor

To indicate a pure consonant without the inherent /a/ vowel, Tamil uses the pulli (a dot above the consonant, analogous to the virama in other Indic scripts):

க = ka (with inherent vowel)
க் = k  (pure consonant, pulli above)

The pulli is encoded as U+0BCD (TAMIL SIGN VIRAMA). In modern Tamil, the pulli is always visible when a consonant stands alone — unlike some North Indian scripts where the virama may trigger a conjunct ligature instead.

Combination Grid

The full Tamil syllabary is generated by combining 18 consonants with 12 vowels, plus the pure consonant form:

18 consonants x (12 vowels + 1 pure form) = 18 x 13 = 234 combinations

Add the 12 independent vowels, ஃ (aytham), and the Grantha consonants with their combinations, and the total practical inventory is around 300 distinct syllabic forms — all generated from fewer than 60 Unicode code points.

The Unicode Tamil Block

Block Range Characters
Tamil U+0B80 – U+0BFF 72 assigned

The Tamil block is notably sparse compared to other Indic blocks — many code points in the range are unassigned, reflecting Tamil's smaller character inventory:

Range Content Count
U+0B82 – U+0B83 Anusvara, Aytham 2
U+0B85 – U+0B94 Independent vowels 12
U+0B95 – U+0BB9 Consonants 18 + 5 Grantha
U+0BBE – U+0BCC Vowel signs 11
U+0BCD Pulli (virama) 1
U+0BD0 Tamil OM 1
U+0BD7 AU length mark 1
U+0BE6 – U+0BEF Tamil digits 10
U+0BF0 – U+0BF2 Tamil numbers (10, 100, 1000) 3
U+0BF3 – U+0BFA Tamil symbols (day, month, year, etc.) 8

Tamil Digits

Tamil has its own digit system:

Tamil Value Code Point
0 U+0BE6
1 U+0BE7
2 U+0BE8
3 U+0BE9
4 U+0BEA
5 U+0BEB
6 U+0BEC
7 U+0BED
8 U+0BEE
9 U+0BEF

Tamil also has special number signs for 10 (௰, U+0BF0), 100 (௱, U+0BF1), and 1000 (௲, U+0BF2), remnants of an older non-positional number system.

Tamil Symbols

The Tamil block includes unique symbols not found in other Indic blocks:

Symbol Name Unicode Usage
Day sign U+0BF3 Traditional calendar
Month sign U+0BF4 Traditional calendar
Year sign U+0BF5 Traditional calendar
Debit sign U+0BF6 Accounting
Credit sign U+0BF7 Accounting
As-above sign U+0BF8 Ditto mark
Rupee sign U+0BF9 Currency
Number sign U+0BFA Numbering

Rendering Tamil Text

Simpler Than Other Indic Scripts

Tamil rendering is considerably simpler than Bengali or Devanagari because modern Tamil has very few conjunct consonant forms. After the script reforms, most consonant clusters are written with an explicit pulli rather than forming ligatures:

Standard:  ந + ் + த = ந்த  (nt — pulli visible on ந, then த)
Contrast:  In Devanagari, न + ् + त → न्त (a conjunct ligature)

The main exceptions where special rendering is needed:

Cluster Visual Notes
க + ் + ஷ க்ஷ ksha — traditional conjunct
ஸ + ் + ரீ ஸ்ரீ Shri — common in proper nouns

Left-Position Vowel Signs

Like Bengali, Tamil has vowel signs that appear to the left of the consonant but are stored after it in Unicode:

Stored:  க (U+0B95) + ெ (U+0BC6)
Rendered: கெ  (vowel sign appears to the left of க)

The rendering engine must reorder these for correct display.

Working with Tamil in Code

Python

import unicodedata

# Tamil character properties
char = "\u0B95"  # க (ka)
print(unicodedata.name(char))      # TAMIL LETTER KA
print(unicodedata.category(char))  # Lo (Letter, other)

# Enumerate the Tamil consonant + vowel grid
CONSONANTS = [chr(c) for c in range(0x0B95, 0x0BBA) if unicodedata.name(chr(c), None)]
VOWEL_SIGNS = {
    "a": "",
    "aa": "\u0BBE",
    "i": "\u0BBF",
    "ii": "\u0BC0",
    "u": "\u0BC1",
    "uu": "\u0BC2",
}

# Generate ka-row
ka = "\u0B95"
for name, sign in VOWEL_SIGNS.items():
    syllable = ka + sign
    print(f"  {name}: {syllable}")

JavaScript

// Tamil Unicode range detection
const tamilPattern = /[\u0B80-\u0BFF]/;

function containsTamil(text) {
  return tamilPattern.test(text);
}

// Grapheme segmentation for Tamil
const segmenter = new Intl.Segmenter("ta", { granularity: "grapheme" });
const text = "தமிழ்";  // "Tamil"
const graphemes = [...segmenter.segment(text)];
console.log(graphemes.map(g => g.segment));
// Individual grapheme clusters (handles pulli correctly)

Sorting Tamil Text

Tamil sorting follows the traditional vowel-consonant order: vowels first (அ ஆ இ ஈ ...), then consonants in their systematic order (க ங ச ஞ ...), then consonant-vowel combinations. ICU provides proper Tamil collation:

import icu

collator = icu.Collator.createInstance(icu.Locale("ta_IN"))
words = ["தமிழ்", "அம்மா", "நன்றி", "கடல்"]
sorted_words = sorted(words, key=collator.getSortKey)
print(sorted_words)  # Tamil dictionary order

TSCII and Legacy Encodings

Before Unicode, Tamil computing was plagued by numerous incompatible encodings:

Encoding Origin Status
TSCII (Tamil Script Code for Information Interchange) Tamil Internet community, 1999 Still used in some contexts
TAB (Tamil Brahmi) Tamil Nadu government Deprecated
TAM (ISCII Tamil) Indian government standard Rarely used
Bamini Sri Lankan Tamil community Still used in diaspora
Various font encodings Individual font vendors Widespread legacy data

The proliferation of font-based encodings — where Tamil text was stored as Latin characters mapped to Tamil glyphs by specific fonts — created massive interoperability problems. Converting legacy Tamil data to Unicode remains an active challenge:

# Conceptual: TSCII to Unicode conversion (simplified)
TSCII_TO_UNICODE = {
    0xA1: "\u0B85",  # அ
    0xA2: "\u0B86",  # ஆ
    0xA3: "\u0B87",  # இ
    # ... hundreds of mappings
}

def tscii_to_unicode(data: bytes) -> str:
    return "".join(TSCII_TO_UNICODE.get(b, chr(b)) for b in data)

Tamil in Modern Digital Context

Web Typography

Tamil text requires extra line height due to vowel signs above and below consonants:

.tamil-text {
  font-family: "Noto Sans Tamil", "Latha", "Vijaya", sans-serif;
  line-height: 1.8;
  font-size: 1.1em; /* Tamil glyphs benefit from slightly larger size */
}

Tamil Domain Names

Tamil is supported in Internationalized Domain Names. The .இந்தியா (India in Tamil) and .சிங்கப்பூர் (Singapore in Tamil) TLDs exist, though adoption remains limited.

Tamil Unicode Consortium Representation

The Tamil Virtual Academy and the Tamil Nadu government have been active participants in Unicode standards work, successfully advocating for the inclusion of Tamil-specific symbols (calendar signs, traditional numerals) and ensuring the Tamil block meets the needs of modern Tamil computing.

Key Takeaways

  • Tamil script has only 18 consonants (no aspirated/voiced distinction in writing) and 12 vowels, making it one of the most compact Indic scripts — encoded in just 72 characters in the Unicode Tamil block (U+0B80–U+0BFF).
  • The pulli (U+0BCD, virama) creates pure consonants and is always visually displayed in modern Tamil, unlike viramas in many other Indic scripts that trigger ligatures.
  • Tamil's consonant-vowel grid generates ~300 syllabic forms from fewer than 60 code points — a compact and systematic encoding.
  • Two-part vowel signs (கொ, கோ, கௌ) appear on both sides of the consonant and are encoded as single code points that decompose into left and right components.
  • Tamil rendering is simpler than most Indic scripts due to minimal conjunct forms after the 20th-century script reforms.
  • Legacy encoding conversion (TSCII, Bamini, font-encoded data) remains a significant challenge for Tamil digital preservation and data migration.

المزيد في Script Stories