Morgan-Tanglish-v7

The first sentence embedding model specifically designed for Romanised Indian code-mixed text.

This is Tanglish — Tamil and English mixed in Roman script. It's how 80+ million Tamil speakers communicate on WhatsApp, YouTube, Instagram, and Reddit every day.

Existing sentence transformers fail on this text because they were trained on formal multilingual corpora with no exposure to code-switching patterns. We built the model that understands how Indians actually type.

TanglishSTS Benchmark

We introduce Morgan-Tanglish-v7 — the first human-annotated semantic similarity benchmark for Romanised Tamil-English code-mixed text. 325 sentence pairs scored 0-5 by a native Tamil speaker across 4 similarity levels.

Main Results (Spearman Correlation ↑)

Model	TanglishSTS	Nuanced STS	Domain Avg	Parameters
Morgan-Tanglish-v7(ours)	0.8689	0.5451	0.7446	118M
L3Cube-IndicSBERT	0.7642	0.1514	0.5729	—
BAAI/bge-m3	0.7583	0.1204	0.6804	570M
all-MiniLM-L6-v2	0.7510	0.2062	0.6094	22.7M
paraphrase-multilingual-MiniLM-L12-v2	0.7279	0.2788	0.5930	118M

Key results:

+15.79 points over all-MiniLM-L6-v2 on TanglishSTS
+11.06 points over BAAI/bge-m3 (570M params) with only 118M params
+10.47 points over L3Cube-IndicSBERT (existing Indian model)
4.5x better nuanced similarity understanding than BAAI/bge-m3

Domain-Specific Performance

Domain	all-MiniLM	bge-m3	IndicSBERT	Morgan-tanglish-st-1
Gym/Fitness	0.667	0.660	0.478	0.782
Movies	0.631	0.810	0.660	0.861
College	0.732	0.773	0.591	0.866
Work/Salary	0.709	0.743	0.597	0.825
Tech (GPU/Phone)	0.628	0.728	0.635	0.701
Social Media	0.298	0.485	0.443	0.385
Food/Family	0.624	0.668	0.698	0.702
Cricket	0.635	0.657	0.462	0.781
Startups	0.602	0.614	0.545	0.750

Wins 8 out of 9 domains. Social media domain remains competitive with ongoing improvements in v2.

Visualizations

Embedding Space (t-SNE)

Sentences cluster by semantic domain without any explicit domain labels — the model learns domain structure purely from meaning.

Benchmark Comparison

Ablation Study — Every Component's Contribution

Similarity Score Distribution

How well each model separates human similarity scores 0-5. Better models show clear separation between score levels.

Ranking Quality

IDENTICAL-SIMILAR gap determines RAG retrieval accuracy. Larger gap = better document retrieval.

Usage

Basic Sentence Similarity

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer("vishnuexe/Morgan-Tanglish-v7")

sentences = [
    "bro office ku late ah vanthen romba tired ah iruku",
    "machan office reach aaga late aachu semma tiredness ah iruku",
    "nalla movie da climax twist semma mass irundhuchu",
]

embeddings = model.encode(sentences)

# embeddings[0] and embeddings[1] → HIGH similarity (same meaning)
# embeddings[0] and embeddings[2] → LOW similarity (different topic)
print(cos_sim(embeddings[0], embeddings[1]))  # ~0.78
print(cos_sim(embeddings[0], embeddings[2]))  # ~0.03

Semantic Search / RAG

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import semantic_search

model = SentenceTransformer("vishnuexe/Morgan-Tanglish-v7")

# knowledge base in Tanglish
corpus = [
    "office la overtime pannanum nu manager sonna",
    "salary hike this quarter expect panna vendam nu HR sonnanga",
    "work from home policy change aagudhu next month",
    "team outing next friday plan pannirukanga",
    "new project kick off monday la iruku prepare pannikonga",
]

query = "boss enna extra work kudutharu da"

corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
query_embedding = model.encode(query, convert_to_tensor=True)

results = semantic_search(query_embedding, corpus_embeddings, top_k=2)
for hit in results[0]:
    print(f"Score: {hit['score']:.4f} | {corpus[hit['corpus_id']]}")

Batch Encoding

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("vishnuexe/Morgan-Tanglish-v7")

sentences = [
    "dei gymku poganum but lazy ah iruku",
    "machan exam nerungudhu prepare pannala tension",
    "bro salary vandhuchu semma happy ah iruku",
    "amma cooking taste yarum beat pannala da",
]

# Encode all at once — efficient batch processing
embeddings = model.encode(sentences, batch_size=32, show_progress_bar=True)
print(embeddings.shape)  # (4, 384)

Duplicate Detection

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer("vishnuexe/Morgan-Tanglish-v7")

def are_duplicates(text1, text2, threshold=0.75):
    emb1 = model.encode([text1])
    emb2 = model.encode([text2])
    similarity = cos_sim(emb1, emb2)[0][0].item()
    return similarity > threshold, similarity

# same complaint, different wording
t1 = "internet speed romba slow ah iruku work pannala"
t2 = "wifi connection mosam ah iruku productivity poiduchu"
is_dup, score = are_duplicates(t1, t2)
print(f"Duplicate: {is_dup} | Score: {score:.4f}")  # True | ~0.76

Training

Data Pipeline

Stage 1 — Corpus Construction Built the largest Romanised Tanglish corpus from scratch:

YouTube comments from 11 domain-diverse channels (cooking, cricket, tech, comedy, news)
Reddit (r/TamilNaduDiscussion)
DravidianCodeMix (Zenodo)
Raw: 1,806,024 comments → Filtered: 185,973 clean sentences
Custom morphological filter (27 patterns, 1,018-token Tamil lexicon) — no langdetect

Related Resources

TanglishSTS Benchmark: vishnuexe/TanglishSTS — 325-pair human-annotated STS evaluation benchmark
Training Corpus: vishnuexe/Tanglish-Corpus-185k — 185,973 sentence Tanglish corpus used for training

Stage 2 — Pair Construction

Source	Pairs	Method
Groq Llama 3.3 70B paraphrases	45,062	LLM generation from corpus
Manual (ChatGPT/Gemini/MetaAI)	4,254	Human curated
Hard negatives (mined)	15,000	Cosine similarity 0.45-0.70
Targeted domain pairs	3,416	8 domains × 15 scenarios
Concept cluster pairs	17,068	C(6,2) combinations per concept
Situation pairs	16,138	Same situation, different angle
Total	86,438

Novel contribution — Concept Cluster Training: Generated 6 maximally different Tanglish expressions of the same concept, then paired all C(6,2)=15 combinations. This directly addresses the lexical gap problem — where models fail when sentences mean the same thing but share zero vocabulary.

Stage 3 — Training Loss: CachedMultipleNegativesRankingLoss + CoSENTLoss (multi-task)

Base: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Epochs: 4

Batch size: 64

Warmup steps: 401

FP16: Yes

Hard negatives: 15,000 pairs (cosine 0.45-0.70)

CoSENTLoss pairs: 1,375 human-scored (0-5 scale)

TanglishSTS Dataset

We release TanglishSTS — 325 human-annotated sentence pairs for Romanised Tamil-English semantic similarity evaluation.

Scored 0-5 by a native Tamil speaker
Covers 4 similarity levels: identical (5), near-identical (4), related (2-3), unrelated (0-1)
Domains: college, work, cricket, food, relationships, tech, social media, daily life
First benchmark of its kind for code-mixed Tamil-English

Use it to evaluate your own models:

import json
from scipy.stats import spearmanr
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

pairs = [json.loads(l) for l in open("tanglish_sts.jsonl")]
model = SentenceTransformer("your-model")

human_scores = [p["human_score"] for p in pairs]
model_scores = [
    cos_sim(model.encode([p["s1"]]), model.encode([p["s2"]]))[0][0].item()
    for p in pairs
]

spearman = spearmanr(human_scores, model_scores)[0]
print(f"TanglishSTS Spearman: {spearman:.4f}")

Limitations

Hinglish coverage is limited — trained primarily on Tanglish. Hinglish (Hindi-English) performance untested. v2 will include L3Cube HingCorpus fine-tuning.
Formal Tamil — not the target domain. Use IndicSBERT for native-script formal Tamil.
Social media lexicon drift — internet slang evolves. Model may underperform on very recent slang not present in training data.
Short texts — optimised for 6-50 word sentences typical of social media. Performance on documents not evaluated.

Intended Use

Recommended for:

Semantic search over Tanglish/Indian English text corpora
RAG systems for Indian language chatbots
Duplicate detection in Indian social media content
Customer support ticket matching for Indian companies
Resume and job description matching for Indian job portals

Not recommended for:

Formal Tamil native script (use IndicSBERT)
Pure English tasks (use all-MiniLM-L6-v2)
Cross-lingual retrieval at scale (use BGE-M3)

About

Built by Vishnu N, 2nd year B.E. CSE student at Sri Sairam Institute of Technology, Chennai.

Previously built TamilSense — fine-tuned MuRIL for Tamil/Tanglish sentiment analysis (94.7% accuracy, 65+ organic downloads).

This project fills a gap that neither academic NLP research nor industry has addressed: a production-grade sentence embedding model for how real Tamil speakers actually communicate online.

HuggingFace: vishnuexe
GitHub: vishnu3105

Live Demo

Try the model instantly — no installation needed:

Morgan-Tanglish-Demo

Enter any two Tanglish sentences and see the similarity score in real time.

Citation

@misc{vishnu2026morgantanglish,
  title={Morgan-Tanglish-v7: Sentence Embeddings for Romanised Tamil-English Code-Mixed Text},
  author={Vishnu N},
  year={2026},
  publisher={Zenodo},
  doi={10.5281/zenodo.20792177},
  url={https://doi.org/10.5281/zenodo.20792177},
  note={Model: huggingface.co/vishnuexe/Morgan-Tanglish-v7}
}

Acknowledgements

Training data sources: DravidianCodeMix (Zenodo, CC BY 4.0), YouTube public comments, Reddit public posts. Paraphrase generation: Groq API (Llama 3.3 70B), ChatGPT, Gemini, Meta AI. Base model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (Apache 2.0).

Downloads last month: 35

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for vishnu-n/Morgan-Tanglish-v7

Base model

sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Finetuned

(310)

this model

Space using vishnu-n/Morgan-Tanglish-v7 1

Evaluation results

Spearman Correlation on TanglishSTS
self-reported

0.869