Morgan-Tanglish-v7

DOI

The first sentence embedding model specifically designed for Romanised Indian code-mixed text.

This is Tanglish โ€” Tamil and English mixed in Roman script. It's how 80+ million Tamil speakers communicate on WhatsApp, YouTube, Instagram, and Reddit every day.

Existing sentence transformers fail on this text because they were trained on formal multilingual corpora with no exposure to code-switching patterns. We built the model that understands how Indians actually type.


TanglishSTS Benchmark

We introduce Morgan-Tanglish-v7 โ€” the first human-annotated semantic similarity benchmark for Romanised Tamil-English code-mixed text. 325 sentence pairs scored 0-5 by a native Tamil speaker across 4 similarity levels.

Main Results (Spearman Correlation โ†‘)

Model TanglishSTS Nuanced STS Domain Avg Parameters
Morgan-Tanglish-v7(ours) 0.8689 0.5451 0.7446 118M
L3Cube-IndicSBERT 0.7642 0.1514 0.5729 โ€”
BAAI/bge-m3 0.7583 0.1204 0.6804 570M
all-MiniLM-L6-v2 0.7510 0.2062 0.6094 22.7M
paraphrase-multilingual-MiniLM-L12-v2 0.7279 0.2788 0.5930 118M

Key results:

  • +15.79 points over all-MiniLM-L6-v2 on TanglishSTS
  • +11.06 points over BAAI/bge-m3 (570M params) with only 118M params
  • +10.47 points over L3Cube-IndicSBERT (existing Indian model)
  • 4.5x better nuanced similarity understanding than BAAI/bge-m3

Domain-Specific Performance

Domain all-MiniLM bge-m3 IndicSBERT Morgan-tanglish-st-1
Gym/Fitness 0.667 0.660 0.478 0.782
Movies 0.631 0.810 0.660 0.861
College 0.732 0.773 0.591 0.866
Work/Salary 0.709 0.743 0.597 0.825
Tech (GPU/Phone) 0.628 0.728 0.635 0.701
Social Media 0.298 0.485 0.443 0.385
Food/Family 0.624 0.668 0.698 0.702
Cricket 0.635 0.657 0.462 0.781
Startups 0.602 0.614 0.545 0.750

Wins 8 out of 9 domains. Social media domain remains competitive with ongoing improvements in v2.

Visualizations

Embedding Space (t-SNE)

Sentences cluster by semantic domain without any explicit domain labels โ€” the model learns domain structure purely from meaning.

t-SNE Embedding Space

Benchmark Comparison

Benchmark Comparison

Ablation Study โ€” Every Component's Contribution

Ablation Study

Similarity Score Distribution

How well each model separates human similarity scores 0-5. Better models show clear separation between score levels.

Score Distribution

Ranking Quality

IDENTICAL-SIMILAR gap determines RAG retrieval accuracy. Larger gap = better document retrieval.

Ranking Quality


Usage

Basic Sentence Similarity

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer("vishnuexe/Morgan-Tanglish-v7")

sentences = [
    "bro office ku late ah vanthen romba tired ah iruku",
    "machan office reach aaga late aachu semma tiredness ah iruku",
    "nalla movie da climax twist semma mass irundhuchu",
]

embeddings = model.encode(sentences)

# embeddings[0] and embeddings[1] โ†’ HIGH similarity (same meaning)
# embeddings[0] and embeddings[2] โ†’ LOW similarity (different topic)
print(cos_sim(embeddings[0], embeddings[1]))  # ~0.78
print(cos_sim(embeddings[0], embeddings[2]))  # ~0.03

Semantic Search / RAG

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import semantic_search

model = SentenceTransformer("vishnuexe/Morgan-Tanglish-v7")

# knowledge base in Tanglish
corpus = [
    "office la overtime pannanum nu manager sonna",
    "salary hike this quarter expect panna vendam nu HR sonnanga",
    "work from home policy change aagudhu next month",
    "team outing next friday plan pannirukanga",
    "new project kick off monday la iruku prepare pannikonga",
]

query = "boss enna extra work kudutharu da"

corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
query_embedding = model.encode(query, convert_to_tensor=True)

results = semantic_search(query_embedding, corpus_embeddings, top_k=2)
for hit in results[0]:
    print(f"Score: {hit['score']:.4f} | {corpus[hit['corpus_id']]}")

Batch Encoding

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("vishnuexe/Morgan-Tanglish-v7")

sentences = [
    "dei gymku poganum but lazy ah iruku",
    "machan exam nerungudhu prepare pannala tension",
    "bro salary vandhuchu semma happy ah iruku",
    "amma cooking taste yarum beat pannala da",
]

# Encode all at once โ€” efficient batch processing
embeddings = model.encode(sentences, batch_size=32, show_progress_bar=True)
print(embeddings.shape)  # (4, 384)

Duplicate Detection

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer("vishnuexe/Morgan-Tanglish-v7")

def are_duplicates(text1, text2, threshold=0.75):
    emb1 = model.encode([text1])
    emb2 = model.encode([text2])
    similarity = cos_sim(emb1, emb2)[0][0].item()
    return similarity > threshold, similarity

# same complaint, different wording
t1 = "internet speed romba slow ah iruku work pannala"
t2 = "wifi connection mosam ah iruku productivity poiduchu"
is_dup, score = are_duplicates(t1, t2)
print(f"Duplicate: {is_dup} | Score: {score:.4f}")  # True | ~0.76

Training

Data Pipeline

Stage 1 โ€” Corpus Construction Built the largest Romanised Tanglish corpus from scratch:

  • YouTube comments from 11 domain-diverse channels (cooking, cricket, tech, comedy, news)
  • Reddit (r/TamilNaduDiscussion)
  • DravidianCodeMix (Zenodo)
  • Raw: 1,806,024 comments โ†’ Filtered: 185,973 clean sentences
  • Custom morphological filter (27 patterns, 1,018-token Tamil lexicon) โ€” no langdetect

Related Resources

Stage 2 โ€” Pair Construction

Source Pairs Method
Groq Llama 3.3 70B paraphrases 45,062 LLM generation from corpus
Manual (ChatGPT/Gemini/MetaAI) 4,254 Human curated
Hard negatives (mined) 15,000 Cosine similarity 0.45-0.70
Targeted domain pairs 3,416 8 domains ร— 15 scenarios
Concept cluster pairs 17,068 C(6,2) combinations per concept
Situation pairs 16,138 Same situation, different angle
Total 86,438

Novel contribution โ€” Concept Cluster Training: Generated 6 maximally different Tanglish expressions of the same concept, then paired all C(6,2)=15 combinations. This directly addresses the lexical gap problem โ€” where models fail when sentences mean the same thing but share zero vocabulary.

Stage 3 โ€” Training Loss: CachedMultipleNegativesRankingLoss + CoSENTLoss (multi-task)

Base: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Epochs: 4

Batch size: 64

Warmup steps: 401

FP16: Yes

Hard negatives: 15,000 pairs (cosine 0.45-0.70)

CoSENTLoss pairs: 1,375 human-scored (0-5 scale)


TanglishSTS Dataset

We release TanglishSTS โ€” 325 human-annotated sentence pairs for Romanised Tamil-English semantic similarity evaluation.

  • Scored 0-5 by a native Tamil speaker
  • Covers 4 similarity levels: identical (5), near-identical (4), related (2-3), unrelated (0-1)
  • Domains: college, work, cricket, food, relationships, tech, social media, daily life
  • First benchmark of its kind for code-mixed Tamil-English

Use it to evaluate your own models:

import json
from scipy.stats import spearmanr
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

pairs = [json.loads(l) for l in open("tanglish_sts.jsonl")]
model = SentenceTransformer("your-model")

human_scores = [p["human_score"] for p in pairs]
model_scores = [
    cos_sim(model.encode([p["s1"]]), model.encode([p["s2"]]))[0][0].item()
    for p in pairs
]

spearman = spearmanr(human_scores, model_scores)[0]
print(f"TanglishSTS Spearman: {spearman:.4f}")

Limitations

  • Hinglish coverage is limited โ€” trained primarily on Tanglish. Hinglish (Hindi-English) performance untested. v2 will include L3Cube HingCorpus fine-tuning.
  • Formal Tamil โ€” not the target domain. Use IndicSBERT for native-script formal Tamil.
  • Social media lexicon drift โ€” internet slang evolves. Model may underperform on very recent slang not present in training data.
  • Short texts โ€” optimised for 6-50 word sentences typical of social media. Performance on documents not evaluated.

Intended Use

Recommended for:

  • Semantic search over Tanglish/Indian English text corpora
  • RAG systems for Indian language chatbots
  • Duplicate detection in Indian social media content
  • Customer support ticket matching for Indian companies
  • Resume and job description matching for Indian job portals

Not recommended for:

  • Formal Tamil native script (use IndicSBERT)
  • Pure English tasks (use all-MiniLM-L6-v2)
  • Cross-lingual retrieval at scale (use BGE-M3)

About

Built by Vishnu N, 2nd year B.E. CSE student at Sri Sairam Institute of Technology, Chennai.

Previously built TamilSense โ€” fine-tuned MuRIL for Tamil/Tanglish sentiment analysis (94.7% accuracy, 65+ organic downloads).

This project fills a gap that neither academic NLP research nor industry has addressed: a production-grade sentence embedding model for how real Tamil speakers actually communicate online.

Live Demo

Try the model instantly โ€” no installation needed:

Morgan-Tanglish-Demo

Enter any two Tanglish sentences and see the similarity score in real time.


Citation

Citation

@misc{vishnu2026morgantanglish,
  title={Morgan-Tanglish-v7: Sentence Embeddings for Romanised Tamil-English Code-Mixed Text},
  author={Vishnu N},
  year={2026},
  publisher={Zenodo},
  doi={10.5281/zenodo.20792177},
  url={https://doi.org/10.5281/zenodo.20792177},
  note={Model: huggingface.co/vishnuexe/Morgan-Tanglish-v7}
}

Acknowledgements

Training data sources: DravidianCodeMix (Zenodo, CC BY 4.0), YouTube public comments, Reddit public posts. Paraphrase generation: Groq API (Llama 3.3 70B), ChatGPT, Gemini, Meta AI. Base model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (Apache 2.0).

Downloads last month
35
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for vishnu-n/Morgan-Tanglish-v7

Space using vishnu-n/Morgan-Tanglish-v7 1

Evaluation results