Instructions to use MWirelabs/ne-embed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use MWirelabs/ne-embed with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("MWirelabs/ne-embed") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
NE-Embed
Multilingual Text Embeddings for Northeast India
10 languages · 768 dimensions · CC-BY-4.0
by MWire Labs · Built on LaBSE
What is NE-Embed?
NE-Embed A multilingual text embedding model purpose-built for Northeast Indian languages. It enables semantic search, retrieval, and RAG across languages that general multilingual models fail on — languages like Garo, Meitei, Nyishi, and Pnar where raw LaBSE achieves under 15% retrieval accuracy.
Trained on 201,738 balanced parallel pairs across 10 languages using bi-encoder fine-tuning with MultipleNegativesRankingLoss.
Supported Languages
| Code | Language | Script | Tier | Training Pairs |
|---|---|---|---|---|
asm |
Assamese | Bengali | ✅ Supported | 25,000 |
brx |
Bodo | Devanagari | ✅ Supported | 25,000 |
grt |
Garo | Latin | ✅ Supported | 25,000 |
kha |
Khasi | Latin | ✅ Supported | 25,000 |
lus |
Mizo | Latin | ✅ Supported | 25,000 |
mni |
Meitei | Meitei Mayek | ✅ Supported | 25,000 |
njz |
Nyishi | Latin | ✅ Supported | 25,000 |
trp |
Kokborok | Latin | ⚠️ Limited | 12,545 |
pbv |
Pnar | Latin | ⚠️ Limited | 6,034 |
nag |
Nagamese | Latin | ⚠️ Limited | 1,996 |
Supported = strong retrieval performance. Limited = model has coverage but quality is lower; use with caution in production.
Performance
Evaluated on 500 samples per language. CLRI = Cross-Language Retrieval Interference (lower is better).
| Language | R@1 (Base) | R@1 (NE-Embed) | CLRI (Base) | CLRI (NE-Embed) |
|---|---|---|---|---|
| Assamese | 95.6 | 97.4 | 1.8% | 4.6% |
| Bodo | 55.8 | 99.8 | 61.0% | 3.0% |
| Garo | 13.2 | 90.8 | 88.8% | 3.0% |
| Khasi | 28.6 | 95.6 | 65.0% | 3.4% |
| Mizo | 46.6 | 91.8 | 58.4% | 9.4% |
| Meitei | 13.6 | 34.2 | 90.8% | 19.8% |
| Nyishi | 10.2 | 75.0 | 71.0% | 17.4% |
| Pnar | 27.2 | 86.2 | 79.6% | 8.0% |
| Kokborok | 26.4 | 71.6 | 63.8% | 11.8% |
| Nagamese | 77.0 | 88.0 | 17.8% | 8.4% |
Base = raw LaBSE zero-shot. All CLRI reductions represent genuine cross-lingual confusion fixed by fine-tuning.
Quick Start
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("MWirelabs/ne-embed")
sentences = [
"Where is the nearest hospital?", # English
"Ngi la pynjot ïa ki shnong baroh bad ngi la pynïap", # Khasi
"Pilakchin an·senganiko man·na am·tokenga.", # Garo
]
embeddings = model.encode(sentences, normalize_embeddings=True)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
Recommended for RAG / Hybrid Retrieval
# Hybrid: NE-Embed dense + BM25 char 3-gram sparse
score = 0.7 * ne_embed_score + 0.3 * bm25_score
Training
- Base model:
sentence-transformers/LaBSE - Loss:
MultipleNegativesRankingLoss - Data: 201,738 English↔NE language parallel pairs, capped at 25k per language to prevent Assamese attractor bias
- Epochs: 3 · Batch size: 64 · Max seq length: 128
- Hardware: 1× NVIDIA A40 (48 GB) · Training time: ~1.3 hours
Intended Uses
- Semantic search
- Dense retrieval
- RAG
- Cross-lingual retrieval
- Clustering
Citation
@misc{mwirelabs2026neembed,
title = {NE-Embed: Multilingual Text Embeddings for Northeast Indian Languages},
author = {MWire Labs},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/MWirelabs/ne-embed}},
note = {CC-BY-4.0}
}
Built with ♥ in Shillong, Meghalaya · MWire Labs · Part of the NE-Stack
NE-LID · NE-BERT · NE-Embed · Kren · Aganbo · Klam
- Downloads last month
- -
Model tree for MWirelabs/ne-embed
Base model
sentence-transformers/LaBSE