nepali-embedder-v1

#1 ranked open Nepali embedding model — benchmarked against 12 open embedding models including bge-m3, qwen3-embedding, snowflake-arctic-embed2, nomic-embed-text-v2-moe, and all existing Nepali-specific models.

Built natively for Nepali language retrieval, fine-tuned on 56k Nepali Wikipedia pairs using google/muril-base-cased as the base encoder.

Benchmark Results

Evaluated on standard Nepali semantic retrieval and four Nepali-specific stress tests.
Gap = Match Score − Unrelated Score (higher = better discrimination).

Standard Retrieval Gap ↑

Rank	Model	Gap	Params
🥇	nepali-embedder-v1 (this model)	0.4277	238M
🥈	jangedoo/all-MiniLM-L6-v2-nepali	0.3382	66M
🥉	universalml/Nepali_Embedding_Model	0.2784	560M
4	Yunika/sentence-transformer-nepali	0.2581	238M
5	qwen3-embedding:0.6b	0.2186	600M
6	bge-m3	0.2092	567M
7	embeddinggemma	0.1924	300M
8	nomic-embed-text-v2-moe	0.1834	MoE
9	paraphrase-multilingual	0.1779	278M
10	snowflake-arctic-embed2	0.1509	568M
11	granite-embedding:278m	0.1437	278M
12	mxbai-embed-large	0.0560	335M

Nepali-Specific Stress Tests ↑

Category	nepali-v1	bge-m3	qwen3-0.6b	Yunika	universalml
Code-Switching (Roman↔Devanagari)	0.490	0.101	0.333	0.263	0.289
Entity Sensitivity	0.605	0.240	0.275	0.256	0.239
Length Robustness	0.674	0.136	0.239	0.313	0.260
Negation	-0.083	-0.022	-0.190	-0.159	-0.093

Key findings: This model is the only one that correctly handles Romanized Nepali ↔ Devanagari code-switching with a positive delta of 0.490 — all other models score below 0.35 on this task. Entity discrimination (0.605) and long-document robustness (0.674) are both best-in-class across all 12 models tested. Negation is a known limitation shared across all Nepali and multilingual embedding models.

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("premmm/nepali-embedder-v1")

# Single sentence
embedding = model.encode("नेपालको राजधानी काठमाडौं हो।", normalize_embeddings=True)

# Semantic similarity
sentences = [
    "नेपालको राजधानी काठमाडौं हो।",
    "काठमाडौं नेपालको सबैभन्दा ठूलो शहर हो।",
]
embeddings = model.encode(sentences, normalize_embeddings=True)

# Retrieval (query vs passages)
from sentence_transformers import util

query = "नेपालको राजधानी कहाँ छ?"
passages = [
    "काठमाडौं नेपालको राजधानी तथा सबैभन्दा ठूलो शहर हो।",
    "पोखरा नेपालको दोस्रो ठूलो शहर हो।",
    "लुम्बिनी गौतम बुद्धको जन्मस्थल हो।",
]

q_emb = model.encode(query, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)
scores = util.cos_sim(q_emb, p_emb)
print(scores)  # tensor([[0.7139, 0.4821, 0.3102]])

Use with LangChain / RAG pipelines

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="premmm/nepali-embedder-v1",
    encode_kwargs={"normalize_embeddings": True}
)

Use with Ollama (self-hosted)

The model can be converted and served locally — see the sentence-transformers documentation for ONNX export if needed for production inference.

Model Details

Property	Value
Base model	google/muril-base-cased
Architecture	BERT (transformer encoder + mean pooling)
Parameters	238M
Embedding dimension	768
Max sequence length	256 tokens
Language	Nepali (ne)
License	Apache 2.0

Training Details

Data

Source: Nepali Wikipedia via wikimedia/wikipedia (config: 20231101.ne)
Total pairs: 56,244 (after deduplication)
- 27,083 title ↔ intro paragraph pairs
- 29,175 section heading ↔ section body pairs
Pair construction: Positive pairs only; in-batch negatives used during training

Training Configuration

Loss: MultipleNegativesRankingLoss (in-batch negatives)
Epochs: 3
Batch size: 16
Warmup steps: 10% of total steps
Optimizer: AdamW (sentence-transformers default)
Hardware: NVIDIA T4 (Google Colab)
Training time: ~2.75 hours

Evaluation (Internal)

Evaluated on a held-out set of 562 pairs + 500 distractor passages using InformationRetrievalEvaluator:

Metric	Final Value
NDCG@10	0.9621
MRR@10	0.9520
Accuracy@1	0.9270
Recall@10	0.9929

Intended Use

Nepali document retrieval — RAG pipelines for Nepali documents
Semantic search — search over Nepali text corpora
Sentence similarity — clustering and deduplication of Nepali text
Legal document retrieval — court rulings, government documents (v2 will include domain fine-tuning)
Cross-script retrieval — handles Romanized Nepali queries against Devanagari passages

Known Limitations

Negation: Like all current Nepali embedding models, does not reliably distinguish negated statements (e.g., "X छ" vs "X छैन")
Cross-lingual: English → Nepali retrieval works partially but was not explicitly trained
Domain: Trained on encyclopedic Wikipedia text; may underperform on highly technical or colloquial domains
Vocabulary: Legal, medical, and scientific Nepali terminology is underrepresented

Roadmap

Version	Planned Additions
v2	Romanized Nepali ↔ Devanagari training pairs (code-switching)
v2	Negation-aware hard negative pairs
v2	Synthetic query augmentation (~2k LLM-generated triplets)
v2	Legal domain fine-tuning (10k Nepali court ruling pairs)
v3	Hard negative mining using v2 model
v3	MatryoshkaLoss for variable-dimension embeddings

Citation

If you use this model in your research or project, please cite:

@misc{pathak2026nepaliembedder,
  author       = {Premanand Pathak},
  title        = {nepali-embedder-v1: A Native Nepali Sentence Embedding Model},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/premmm/nepali-embedder-v1}},
}

Acknowledgements

Training data: Wikimedia Foundation / Nepali Wikipedia
Base model: Google MuRIL
Training framework: sentence-transformers
Benchmark comparison models: Yunika, universalml, jangedoo, BAAI, Alibaba Qwen, Snowflake, Nomic, Google, IBM, Mixedbread

Downloads last month: -

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for premmm/nepali-embedder-v1

Base model

google/muril-base-cased

Finetuned

(60)

this model

premmm
/

nepali-embedder-v1