🇮🇳 TLMR: Tamil Language Model for Representation Learning

TLMR (Tamil Language Model for Representation Learning) is a lightweight Tamil-specific encoder built on the DeBERTa-V3 architecture and designed to learn high-quality contextual representations for Tamil. The model is pretrained exclusively on large-scale Tamil text using a custom Tamil Byte-Pair Encoding (BPE) tokenizer, enabling efficient tokenization and strong semantic representation for morphologically rich Tamil.

Unlike multilingual language models that share their vocabulary and parameters across many languages, TLMR focuses specifically on Tamil, allowing it to better capture the linguistic characteristics of the language and provide more effective representations for downstream NLP tasks.

Model Details

Model Architecture: DeBERTa-V3
Encoder Layers: 6
Hidden Size: 768
Attention Heads: 8
Tokenizer: Custom Tamil BPE
Vocabulary Size: 32K
Pretraining Corpus: 525M Tamil tokens

Pretraining Data

TLMR was pretrained on a large, carefully curated Tamil corpus collected from multiple high-quality sources, including:

Tamil Wikipedia
Project Madurai
IndicCorp v2
OSCAR
Tamil news articles

The corpus was cleaned through text normalization, deduplication, script consistency filtering, and noise removal before tokenizer training and model pretraining.

Intended Uses

TLMR is designed as a general-purpose Tamil language encoder and can be used for a wide range of Tamil NLP applications, including:

Sentence Embedding
Semantic Similarity
Paraphrase Detection
Text Classification
Information Retrieval
Question Answering
Named Entity Recognition
Token Classification
Research on Tamil Language Understanding

Why TLMR?

Tamil is a morphologically rich and agglutinative language, making effective tokenization and representation learning particularly important. TLMR is designed to address these challenges through Tamil-specific pretraining and tokenizer design, providing strong contextual representations while maintaining a lightweight architecture suitable for research and downstream applications.

Example Usage

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("viswadarshan06/TLMR")
model = AutoModel.from_pretrained("viswadarshan06/TLMR")

text = "தமிழ்நாட்டின் வளர்ச்சி தொடர்கிறது."

inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

embeddings = outputs.last_hidden_state

Acknowledgements

TLMR was developed as part of our research on Tamil semantic understanding and representation learning and published in *SEM 2026 - ACL Conference.

Downloads last month: -

Safetensors

Model size

92.7M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for viswadarshan06/TLMR

Base model

microsoft/deberta-v3-base

Finetuned

(644)

this model