Instructions to use vishnu-n/Morgan-Tanglish-v7 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use vishnu-n/Morgan-Tanglish-v7 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("vishnu-n/Morgan-Tanglish-v7") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
Morgan-Tanglish-v7
The first sentence embedding model specifically designed for Romanised Indian code-mixed text.
This is Tanglish โ Tamil and English mixed in Roman script. It's how 80+ million Tamil speakers communicate on WhatsApp, YouTube, Instagram, and Reddit every day.
Existing sentence transformers fail on this text because they were trained on formal multilingual corpora with no exposure to code-switching patterns. We built the model that understands how Indians actually type.
TanglishSTS Benchmark
We introduce Morgan-Tanglish-v7 โ the first human-annotated semantic similarity benchmark for Romanised Tamil-English code-mixed text. 325 sentence pairs scored 0-5 by a native Tamil speaker across 4 similarity levels.
Main Results (Spearman Correlation โ)
| Model | TanglishSTS | Nuanced STS | Domain Avg | Parameters |
|---|---|---|---|---|
| Morgan-Tanglish-v7(ours) | 0.8689 | 0.5451 | 0.7446 | 118M |
| L3Cube-IndicSBERT | 0.7642 | 0.1514 | 0.5729 | โ |
| BAAI/bge-m3 | 0.7583 | 0.1204 | 0.6804 | 570M |
| all-MiniLM-L6-v2 | 0.7510 | 0.2062 | 0.6094 | 22.7M |
| paraphrase-multilingual-MiniLM-L12-v2 | 0.7279 | 0.2788 | 0.5930 | 118M |
Key results:
- +15.79 points over all-MiniLM-L6-v2 on TanglishSTS
- +11.06 points over BAAI/bge-m3 (570M params) with only 118M params
- +10.47 points over L3Cube-IndicSBERT (existing Indian model)
- 4.5x better nuanced similarity understanding than BAAI/bge-m3
Domain-Specific Performance
| Domain | all-MiniLM | bge-m3 | IndicSBERT | Morgan-tanglish-st-1 |
|---|---|---|---|---|
| Gym/Fitness | 0.667 | 0.660 | 0.478 | 0.782 |
| Movies | 0.631 | 0.810 | 0.660 | 0.861 |
| College | 0.732 | 0.773 | 0.591 | 0.866 |
| Work/Salary | 0.709 | 0.743 | 0.597 | 0.825 |
| Tech (GPU/Phone) | 0.628 | 0.728 | 0.635 | 0.701 |
| Social Media | 0.298 | 0.485 | 0.443 | 0.385 |
| Food/Family | 0.624 | 0.668 | 0.698 | 0.702 |
| Cricket | 0.635 | 0.657 | 0.462 | 0.781 |
| Startups | 0.602 | 0.614 | 0.545 | 0.750 |
Wins 8 out of 9 domains. Social media domain remains competitive with ongoing improvements in v2.
Visualizations
Embedding Space (t-SNE)
Sentences cluster by semantic domain without any explicit domain labels โ the model learns domain structure purely from meaning.
Benchmark Comparison
Ablation Study โ Every Component's Contribution
Similarity Score Distribution
How well each model separates human similarity scores 0-5. Better models show clear separation between score levels.
Ranking Quality
IDENTICAL-SIMILAR gap determines RAG retrieval accuracy. Larger gap = better document retrieval.
Usage
Basic Sentence Similarity
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
model = SentenceTransformer("vishnuexe/Morgan-Tanglish-v7")
sentences = [
"bro office ku late ah vanthen romba tired ah iruku",
"machan office reach aaga late aachu semma tiredness ah iruku",
"nalla movie da climax twist semma mass irundhuchu",
]
embeddings = model.encode(sentences)
# embeddings[0] and embeddings[1] โ HIGH similarity (same meaning)
# embeddings[0] and embeddings[2] โ LOW similarity (different topic)
print(cos_sim(embeddings[0], embeddings[1])) # ~0.78
print(cos_sim(embeddings[0], embeddings[2])) # ~0.03
Semantic Search / RAG
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import semantic_search
model = SentenceTransformer("vishnuexe/Morgan-Tanglish-v7")
# knowledge base in Tanglish
corpus = [
"office la overtime pannanum nu manager sonna",
"salary hike this quarter expect panna vendam nu HR sonnanga",
"work from home policy change aagudhu next month",
"team outing next friday plan pannirukanga",
"new project kick off monday la iruku prepare pannikonga",
]
query = "boss enna extra work kudutharu da"
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
query_embedding = model.encode(query, convert_to_tensor=True)
results = semantic_search(query_embedding, corpus_embeddings, top_k=2)
for hit in results[0]:
print(f"Score: {hit['score']:.4f} | {corpus[hit['corpus_id']]}")
Batch Encoding
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("vishnuexe/Morgan-Tanglish-v7")
sentences = [
"dei gymku poganum but lazy ah iruku",
"machan exam nerungudhu prepare pannala tension",
"bro salary vandhuchu semma happy ah iruku",
"amma cooking taste yarum beat pannala da",
]
# Encode all at once โ efficient batch processing
embeddings = model.encode(sentences, batch_size=32, show_progress_bar=True)
print(embeddings.shape) # (4, 384)
Duplicate Detection
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
model = SentenceTransformer("vishnuexe/Morgan-Tanglish-v7")
def are_duplicates(text1, text2, threshold=0.75):
emb1 = model.encode([text1])
emb2 = model.encode([text2])
similarity = cos_sim(emb1, emb2)[0][0].item()
return similarity > threshold, similarity
# same complaint, different wording
t1 = "internet speed romba slow ah iruku work pannala"
t2 = "wifi connection mosam ah iruku productivity poiduchu"
is_dup, score = are_duplicates(t1, t2)
print(f"Duplicate: {is_dup} | Score: {score:.4f}") # True | ~0.76
Training
Data Pipeline
Stage 1 โ Corpus Construction Built the largest Romanised Tanglish corpus from scratch:
- YouTube comments from 11 domain-diverse channels (cooking, cricket, tech, comedy, news)
- Reddit (r/TamilNaduDiscussion)
- DravidianCodeMix (Zenodo)
- Raw: 1,806,024 comments โ Filtered: 185,973 clean sentences
- Custom morphological filter (27 patterns, 1,018-token Tamil lexicon) โ no langdetect
Related Resources
- TanglishSTS Benchmark: vishnuexe/TanglishSTS โ 325-pair human-annotated STS evaluation benchmark
- Training Corpus: vishnuexe/Tanglish-Corpus-185k โ 185,973 sentence Tanglish corpus used for training
Stage 2 โ Pair Construction
| Source | Pairs | Method |
|---|---|---|
| Groq Llama 3.3 70B paraphrases | 45,062 | LLM generation from corpus |
| Manual (ChatGPT/Gemini/MetaAI) | 4,254 | Human curated |
| Hard negatives (mined) | 15,000 | Cosine similarity 0.45-0.70 |
| Targeted domain pairs | 3,416 | 8 domains ร 15 scenarios |
| Concept cluster pairs | 17,068 | C(6,2) combinations per concept |
| Situation pairs | 16,138 | Same situation, different angle |
| Total | 86,438 |
Novel contribution โ Concept Cluster Training: Generated 6 maximally different Tanglish expressions of the same concept, then paired all C(6,2)=15 combinations. This directly addresses the lexical gap problem โ where models fail when sentences mean the same thing but share zero vocabulary.
Stage 3 โ Training Loss: CachedMultipleNegativesRankingLoss + CoSENTLoss (multi-task)
Base: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Epochs: 4
Batch size: 64
Warmup steps: 401
FP16: Yes
Hard negatives: 15,000 pairs (cosine 0.45-0.70)
CoSENTLoss pairs: 1,375 human-scored (0-5 scale)
TanglishSTS Dataset
We release TanglishSTS โ 325 human-annotated sentence pairs for Romanised Tamil-English semantic similarity evaluation.
- Scored 0-5 by a native Tamil speaker
- Covers 4 similarity levels: identical (5), near-identical (4), related (2-3), unrelated (0-1)
- Domains: college, work, cricket, food, relationships, tech, social media, daily life
- First benchmark of its kind for code-mixed Tamil-English
Use it to evaluate your own models:
import json
from scipy.stats import spearmanr
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
pairs = [json.loads(l) for l in open("tanglish_sts.jsonl")]
model = SentenceTransformer("your-model")
human_scores = [p["human_score"] for p in pairs]
model_scores = [
cos_sim(model.encode([p["s1"]]), model.encode([p["s2"]]))[0][0].item()
for p in pairs
]
spearman = spearmanr(human_scores, model_scores)[0]
print(f"TanglishSTS Spearman: {spearman:.4f}")
Limitations
- Hinglish coverage is limited โ trained primarily on Tanglish. Hinglish (Hindi-English) performance untested. v2 will include L3Cube HingCorpus fine-tuning.
- Formal Tamil โ not the target domain. Use IndicSBERT for native-script formal Tamil.
- Social media lexicon drift โ internet slang evolves. Model may underperform on very recent slang not present in training data.
- Short texts โ optimised for 6-50 word sentences typical of social media. Performance on documents not evaluated.
Intended Use
Recommended for:
- Semantic search over Tanglish/Indian English text corpora
- RAG systems for Indian language chatbots
- Duplicate detection in Indian social media content
- Customer support ticket matching for Indian companies
- Resume and job description matching for Indian job portals
Not recommended for:
- Formal Tamil native script (use IndicSBERT)
- Pure English tasks (use all-MiniLM-L6-v2)
- Cross-lingual retrieval at scale (use BGE-M3)
About
Built by Vishnu N, 2nd year B.E. CSE student at Sri Sairam Institute of Technology, Chennai.
Previously built TamilSense โ fine-tuned MuRIL for Tamil/Tanglish sentiment analysis (94.7% accuracy, 65+ organic downloads).
This project fills a gap that neither academic NLP research nor industry has addressed: a production-grade sentence embedding model for how real Tamil speakers actually communicate online.
- HuggingFace: vishnuexe
- GitHub: vishnu3105
Live Demo
Try the model instantly โ no installation needed:
Enter any two Tanglish sentences and see the similarity score in real time.
Citation
Citation
@misc{vishnu2026morgantanglish,
title={Morgan-Tanglish-v7: Sentence Embeddings for Romanised Tamil-English Code-Mixed Text},
author={Vishnu N},
year={2026},
publisher={Zenodo},
doi={10.5281/zenodo.20792177},
url={https://doi.org/10.5281/zenodo.20792177},
note={Model: huggingface.co/vishnuexe/Morgan-Tanglish-v7}
}
Acknowledgements
Training data sources: DravidianCodeMix (Zenodo, CC BY 4.0), YouTube public comments, Reddit public posts. Paraphrase generation: Groq API (Llama 3.3 70B), ChatGPT, Gemini, Meta AI. Base model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (Apache 2.0).
- Downloads last month
- 35
Model tree for vishnu-n/Morgan-Tanglish-v7
Space using vishnu-n/Morgan-Tanglish-v7 1
Evaluation results
- Spearman Correlation on TanglishSTSself-reported0.869




