NE-Embed

Multilingual Text Embeddings for Northeast India

10 languages · 768 dimensions · CC-BY-4.0

What is NE-Embed?

NE-Embed A multilingual text embedding model purpose-built for Northeast Indian languages. It enables semantic search, retrieval, and RAG across languages that general multilingual models fail on — languages like Garo, Meitei, Nyishi, and Pnar where raw LaBSE achieves under 15% retrieval accuracy.

Trained on 201,738 balanced parallel pairs across 10 languages using bi-encoder fine-tuning with MultipleNegativesRankingLoss.

Supported Languages

Code	Language	Script	Tier	Training Pairs
`asm`	Assamese	Bengali	✅ Supported	25,000
`brx`	Bodo	Devanagari	✅ Supported	25,000
`grt`	Garo	Latin	✅ Supported	25,000
`kha`	Khasi	Latin	✅ Supported	25,000
`lus`	Mizo	Latin	✅ Supported	25,000
`mni`	Meitei	Meitei Mayek	✅ Supported	25,000
`njz`	Nyishi	Latin	✅ Supported	25,000
`trp`	Kokborok	Latin	⚠️ Limited	12,545
`pbv`	Pnar	Latin	⚠️ Limited	6,034
`nag`	Nagamese	Latin	⚠️ Limited	1,996

Supported = strong retrieval performance. Limited = model has coverage but quality is lower; use with caution in production.

Performance

Evaluated on 500 samples per language. CLRI = Cross-Language Retrieval Interference (lower is better).

Language	R@1 (Base)	R@1 (NE-Embed)	CLRI (Base)	CLRI (NE-Embed)
Assamese	95.6	97.4	1.8%	4.6%
Bodo	55.8	99.8	61.0%	3.0%
Garo	13.2	90.8	88.8%	3.0%
Khasi	28.6	95.6	65.0%	3.4%
Mizo	46.6	91.8	58.4%	9.4%
Meitei	13.6	34.2	90.8%	19.8%
Nyishi	10.2	75.0	71.0%	17.4%
Pnar	27.2	86.2	79.6%	8.0%
Kokborok	26.4	71.6	63.8%	11.8%
Nagamese	77.0	88.0	17.8%	8.4%

Base = raw LaBSE zero-shot. All CLRI reductions represent genuine cross-lingual confusion fixed by fine-tuning.

Quick Start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("MWirelabs/ne-embed")

sentences = [
    "Where is the nearest hospital?",                          # English
    "Ngi la pynjot ïa ki shnong baroh bad ngi la pynïap",     # Khasi
    "Pilakchin an·senganiko man·na am·tokenga.",               # Garo
]

embeddings = model.encode(sentences, normalize_embeddings=True)
similarities = model.similarity(embeddings, embeddings)
print(similarities)

Recommended for RAG / Hybrid Retrieval

# Hybrid: NE-Embed dense + BM25 char 3-gram sparse
score = 0.7 * ne_embed_score + 0.3 * bm25_score

Training

Base model: sentence-transformers/LaBSE
Loss: MultipleNegativesRankingLoss
Data: 201,738 English↔NE language parallel pairs, capped at 25k per language to prevent Assamese attractor bias
Epochs: 3 · Batch size: 64 · Max seq length: 128
Hardware: 1× NVIDIA A40 (48 GB) · Training time: ~1.3 hours

Intended Uses

Semantic search
Dense retrieval
RAG
Cross-lingual retrieval
Clustering

Citation

@misc{mwirelabs2026neembed,
  title        = {NE-Embed: Multilingual Text Embeddings for Northeast Indian Languages},
  author       = {MWire Labs},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/MWirelabs/ne-embed}},
  note         = {CC-BY-4.0}
}

Built with ♥ in Shillong, Meghalaya · MWire Labs · Part of the NE-Stack

NE-LID · NE-BERT · NE-Embed · Kren · Aganbo · Klam

Downloads last month: -

Safetensors

Model size

0.5B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MWirelabs/ne-embed

Base model

sentence-transformers/LaBSE

Finetuned

(89)

this model