🇮🇳 TLMR: Tamil Language Model for Representation Learning

TLMR (Tamil Language Model for Representation Learning) is a lightweight Tamil-specific encoder built on the DeBERTa-V3 architecture and designed to learn high-quality contextual representations for Tamil. The model is pretrained exclusively on large-scale Tamil text using a custom Tamil Byte-Pair Encoding (BPE) tokenizer, enabling efficient tokenization and strong semantic representation for morphologically rich Tamil.

Unlike multilingual language models that share their vocabulary and parameters across many languages, TLMR focuses specifically on Tamil, allowing it to better capture the linguistic characteristics of the language and provide more effective representations for downstream NLP tasks.


Model Details

  • Model Architecture: DeBERTa-V3
  • Encoder Layers: 6
  • Hidden Size: 768
  • Attention Heads: 8
  • Tokenizer: Custom Tamil BPE
  • Vocabulary Size: 32K
  • Pretraining Corpus: 525M Tamil tokens

Pretraining Data

TLMR was pretrained on a large, carefully curated Tamil corpus collected from multiple high-quality sources, including:

  • Tamil Wikipedia
  • Project Madurai
  • IndicCorp v2
  • OSCAR
  • Tamil news articles

The corpus was cleaned through text normalization, deduplication, script consistency filtering, and noise removal before tokenizer training and model pretraining.


Intended Uses

TLMR is designed as a general-purpose Tamil language encoder and can be used for a wide range of Tamil NLP applications, including:

  • Sentence Embedding
  • Semantic Similarity
  • Paraphrase Detection
  • Text Classification
  • Information Retrieval
  • Question Answering
  • Named Entity Recognition
  • Token Classification
  • Research on Tamil Language Understanding

Why TLMR?

Tamil is a morphologically rich and agglutinative language, making effective tokenization and representation learning particularly important. TLMR is designed to address these challenges through Tamil-specific pretraining and tokenizer design, providing strong contextual representations while maintaining a lightweight architecture suitable for research and downstream applications.


Example Usage

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("viswadarshan06/TLMR")
model = AutoModel.from_pretrained("viswadarshan06/TLMR")

text = "தமிழ்நாட்டின் வளர்ச்சி தொடர்கிறது."

inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

embeddings = outputs.last_hidden_state

Acknowledgements

TLMR was developed as part of our research on Tamil semantic understanding and representation learning and published in *SEM 2026 - ACL Conference.

Downloads last month
-
Safetensors
Model size
92.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for viswadarshan06/TLMR

Finetuned
(644)
this model