nepali-embedder-v1

#1 ranked open Nepali embedding model — benchmarked against 12 open embedding models including bge-m3, qwen3-embedding, snowflake-arctic-embed2, nomic-embed-text-v2-moe, and all existing Nepali-specific models.

Built natively for Nepali language retrieval, fine-tuned on 56k Nepali Wikipedia pairs using google/muril-base-cased as the base encoder.


Benchmark Results

Evaluated on standard Nepali semantic retrieval and four Nepali-specific stress tests.
Gap = Match Score − Unrelated Score (higher = better discrimination).

Standard Retrieval Gap ↑

Rank Model Gap Params
🥇 nepali-embedder-v1 (this model) 0.4277 238M
🥈 jangedoo/all-MiniLM-L6-v2-nepali 0.3382 66M
🥉 universalml/Nepali_Embedding_Model 0.2784 560M
4 Yunika/sentence-transformer-nepali 0.2581 238M
5 qwen3-embedding:0.6b 0.2186 600M
6 bge-m3 0.2092 567M
7 embeddinggemma 0.1924 300M
8 nomic-embed-text-v2-moe 0.1834 MoE
9 paraphrase-multilingual 0.1779 278M
10 snowflake-arctic-embed2 0.1509 568M
11 granite-embedding:278m 0.1437 278M
12 mxbai-embed-large 0.0560 335M

Nepali-Specific Stress Tests ↑

Category nepali-v1 bge-m3 qwen3-0.6b Yunika universalml
Code-Switching (Roman↔Devanagari) 0.490 0.101 0.333 0.263 0.289
Entity Sensitivity 0.605 0.240 0.275 0.256 0.239
Length Robustness 0.674 0.136 0.239 0.313 0.260
Negation -0.083 -0.022 -0.190 -0.159 -0.093

Key findings: This model is the only one that correctly handles Romanized Nepali ↔ Devanagari code-switching with a positive delta of 0.490 — all other models score below 0.35 on this task. Entity discrimination (0.605) and long-document robustness (0.674) are both best-in-class across all 12 models tested. Negation is a known limitation shared across all Nepali and multilingual embedding models.


Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("premmm/nepali-embedder-v1")

# Single sentence
embedding = model.encode("नेपालको राजधानी काठमाडौं हो।", normalize_embeddings=True)

# Semantic similarity
sentences = [
    "नेपालको राजधानी काठमाडौं हो।",
    "काठमाडौं नेपालको सबैभन्दा ठूलो शहर हो।",
]
embeddings = model.encode(sentences, normalize_embeddings=True)

# Retrieval (query vs passages)
from sentence_transformers import util

query = "नेपालको राजधानी कहाँ छ?"
passages = [
    "काठमाडौं नेपालको राजधानी तथा सबैभन्दा ठूलो शहर हो।",
    "पोखरा नेपालको दोस्रो ठूलो शहर हो।",
    "लुम्बिनी गौतम बुद्धको जन्मस्थल हो।",
]

q_emb = model.encode(query, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)
scores = util.cos_sim(q_emb, p_emb)
print(scores)  # tensor([[0.7139, 0.4821, 0.3102]])

Use with LangChain / RAG pipelines

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="premmm/nepali-embedder-v1",
    encode_kwargs={"normalize_embeddings": True}
)

Use with Ollama (self-hosted)

The model can be converted and served locally — see the sentence-transformers documentation for ONNX export if needed for production inference.


Model Details

Property Value
Base model google/muril-base-cased
Architecture BERT (transformer encoder + mean pooling)
Parameters 238M
Embedding dimension 768
Max sequence length 256 tokens
Language Nepali (ne)
License Apache 2.0

Training Details

Data

  • Source: Nepali Wikipedia via wikimedia/wikipedia (config: 20231101.ne)
  • Total pairs: 56,244 (after deduplication)
    • 27,083 title ↔ intro paragraph pairs
    • 29,175 section heading ↔ section body pairs
  • Pair construction: Positive pairs only; in-batch negatives used during training

Training Configuration

  • Loss: MultipleNegativesRankingLoss (in-batch negatives)
  • Epochs: 3
  • Batch size: 16
  • Warmup steps: 10% of total steps
  • Optimizer: AdamW (sentence-transformers default)
  • Hardware: NVIDIA T4 (Google Colab)
  • Training time: ~2.75 hours

Evaluation (Internal)

Evaluated on a held-out set of 562 pairs + 500 distractor passages using InformationRetrievalEvaluator:

Metric Final Value
NDCG@10 0.9621
MRR@10 0.9520
Accuracy@1 0.9270
Recall@10 0.9929

Intended Use

  • Nepali document retrieval — RAG pipelines for Nepali documents
  • Semantic search — search over Nepali text corpora
  • Sentence similarity — clustering and deduplication of Nepali text
  • Legal document retrieval — court rulings, government documents (v2 will include domain fine-tuning)
  • Cross-script retrieval — handles Romanized Nepali queries against Devanagari passages

Known Limitations

  • Negation: Like all current Nepali embedding models, does not reliably distinguish negated statements (e.g., "X छ" vs "X छैन")
  • Cross-lingual: English → Nepali retrieval works partially but was not explicitly trained
  • Domain: Trained on encyclopedic Wikipedia text; may underperform on highly technical or colloquial domains
  • Vocabulary: Legal, medical, and scientific Nepali terminology is underrepresented

Roadmap

Version Planned Additions
v2 Romanized Nepali ↔ Devanagari training pairs (code-switching)
v2 Negation-aware hard negative pairs
v2 Synthetic query augmentation (~2k LLM-generated triplets)
v2 Legal domain fine-tuning (10k Nepali court ruling pairs)
v3 Hard negative mining using v2 model
v3 MatryoshkaLoss for variable-dimension embeddings

Citation

If you use this model in your research or project, please cite:

@misc{pathak2026nepaliembedder,
  author       = {Premanand Pathak},
  title        = {nepali-embedder-v1: A Native Nepali Sentence Embedding Model},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/premmm/nepali-embedder-v1}},
}

Acknowledgements

Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for premmm/nepali-embedder-v1

Finetuned
(60)
this model

Dataset used to train premmm/nepali-embedder-v1