NE-Embed

Multilingual Text Embeddings for Northeast India

10 languages · 768 dimensions · CC-BY-4.0

by MWire Labs · Built on LaBSE


What is NE-Embed?

NE-Embed A multilingual text embedding model purpose-built for Northeast Indian languages. It enables semantic search, retrieval, and RAG across languages that general multilingual models fail on — languages like Garo, Meitei, Nyishi, and Pnar where raw LaBSE achieves under 15% retrieval accuracy.

Trained on 201,738 balanced parallel pairs across 10 languages using bi-encoder fine-tuning with MultipleNegativesRankingLoss.


Supported Languages

Code Language Script Tier Training Pairs
asm Assamese Bengali ✅ Supported 25,000
brx Bodo Devanagari ✅ Supported 25,000
grt Garo Latin ✅ Supported 25,000
kha Khasi Latin ✅ Supported 25,000
lus Mizo Latin ✅ Supported 25,000
mni Meitei Meitei Mayek ✅ Supported 25,000
njz Nyishi Latin ✅ Supported 25,000
trp Kokborok Latin ⚠️ Limited 12,545
pbv Pnar Latin ⚠️ Limited 6,034
nag Nagamese Latin ⚠️ Limited 1,996

Supported = strong retrieval performance. Limited = model has coverage but quality is lower; use with caution in production.


Performance

Evaluated on 500 samples per language. CLRI = Cross-Language Retrieval Interference (lower is better).

Language R@1 (Base) R@1 (NE-Embed) CLRI (Base) CLRI (NE-Embed)
Assamese 95.6 97.4 1.8% 4.6%
Bodo 55.8 99.8 61.0% 3.0%
Garo 13.2 90.8 88.8% 3.0%
Khasi 28.6 95.6 65.0% 3.4%
Mizo 46.6 91.8 58.4% 9.4%
Meitei 13.6 34.2 90.8% 19.8%
Nyishi 10.2 75.0 71.0% 17.4%
Pnar 27.2 86.2 79.6% 8.0%
Kokborok 26.4 71.6 63.8% 11.8%
Nagamese 77.0 88.0 17.8% 8.4%

Base = raw LaBSE zero-shot. All CLRI reductions represent genuine cross-lingual confusion fixed by fine-tuning.


Quick Start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("MWirelabs/ne-embed")

sentences = [
    "Where is the nearest hospital?",                          # English
    "Ngi la pynjot ïa ki shnong baroh bad ngi la pynïap",     # Khasi
    "Pilakchin an·senganiko man·na am·tokenga.",               # Garo
]

embeddings = model.encode(sentences, normalize_embeddings=True)
similarities = model.similarity(embeddings, embeddings)
print(similarities)

Recommended for RAG / Hybrid Retrieval

# Hybrid: NE-Embed dense + BM25 char 3-gram sparse
score = 0.7 * ne_embed_score + 0.3 * bm25_score

Training

  • Base model: sentence-transformers/LaBSE
  • Loss: MultipleNegativesRankingLoss
  • Data: 201,738 English↔NE language parallel pairs, capped at 25k per language to prevent Assamese attractor bias
  • Epochs: 3 · Batch size: 64 · Max seq length: 128
  • Hardware: 1× NVIDIA A40 (48 GB) · Training time: ~1.3 hours

Intended Uses

  • Semantic search
  • Dense retrieval
  • RAG
  • Cross-lingual retrieval
  • Clustering

Citation

@misc{mwirelabs2026neembed,
  title        = {NE-Embed: Multilingual Text Embeddings for Northeast Indian Languages},
  author       = {MWire Labs},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/MWirelabs/ne-embed}},
  note         = {CC-BY-4.0}
}

Built with ♥ in Shillong, Meghalaya · MWire Labs · Part of the NE-Stack

NE-LID · NE-BERT · NE-Embed · Kren · Aganbo · Klam

Downloads last month
-
Safetensors
Model size
0.5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MWirelabs/ne-embed

Finetuned
(89)
this model