🇮🇳 TLMR: Tamil Language Model for Representation Learning
TLMR (Tamil Language Model for Representation Learning) is a lightweight Tamil-specific encoder built on the DeBERTa-V3 architecture and designed to learn high-quality contextual representations for Tamil. The model is pretrained exclusively on large-scale Tamil text using a custom Tamil Byte-Pair Encoding (BPE) tokenizer, enabling efficient tokenization and strong semantic representation for morphologically rich Tamil.
Unlike multilingual language models that share their vocabulary and parameters across many languages, TLMR focuses specifically on Tamil, allowing it to better capture the linguistic characteristics of the language and provide more effective representations for downstream NLP tasks.
Model Details
- Model Architecture: DeBERTa-V3
- Encoder Layers: 6
- Hidden Size: 768
- Attention Heads: 8
- Tokenizer: Custom Tamil BPE
- Vocabulary Size: 32K
- Pretraining Corpus: 525M Tamil tokens
Pretraining Data
TLMR was pretrained on a large, carefully curated Tamil corpus collected from multiple high-quality sources, including:
- Tamil Wikipedia
- Project Madurai
- IndicCorp v2
- OSCAR
- Tamil news articles
The corpus was cleaned through text normalization, deduplication, script consistency filtering, and noise removal before tokenizer training and model pretraining.
Intended Uses
TLMR is designed as a general-purpose Tamil language encoder and can be used for a wide range of Tamil NLP applications, including:
- Sentence Embedding
- Semantic Similarity
- Paraphrase Detection
- Text Classification
- Information Retrieval
- Question Answering
- Named Entity Recognition
- Token Classification
- Research on Tamil Language Understanding
Why TLMR?
Tamil is a morphologically rich and agglutinative language, making effective tokenization and representation learning particularly important. TLMR is designed to address these challenges through Tamil-specific pretraining and tokenizer design, providing strong contextual representations while maintaining a lightweight architecture suitable for research and downstream applications.
Example Usage
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("viswadarshan06/TLMR")
model = AutoModel.from_pretrained("viswadarshan06/TLMR")
text = "தமிழ்நாட்டின் வளர்ச்சி தொடர்கிறது."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state
Acknowledgements
TLMR was developed as part of our research on Tamil semantic understanding and representation learning and published in *SEM 2026 - ACL Conference.
- Downloads last month
- -
Model tree for viswadarshan06/TLMR
Base model
microsoft/deberta-v3-base