Instructions to use kkkamur07/hindi-modernbert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use kkkamur07/hindi-modernbert with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="kkkamur07/hindi-modernbert")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("kkkamur07/hindi-modernbert") model = AutoModelForMaskedLM.from_pretrained("kkkamur07/hindi-modernbert") - Notebooks
- Google Colab
- Kaggle
hindi-modernBERT
TL;DR: This is a pretrained base MLM checkpoint: a Hindi extension of the ModernBERT architecture, trained from scratch on Hindi text. The base model is competitive with other models across tasks, and it outperforms them on retrieval after DPR fine-tuning. Checkpoint ba1157 · 8192 context · Hindi BPE vocab 50,368.
This release uses the ModernBERT architecture and training recipe, adapted for Hindi with a new tokenizer and ~28B tokens of Hindi pretraining.
Model summary
| Type | Base MLM checkpoint (pretrained, not task fine-tuned) |
| Architecture | ModernBERT (ModernBertForMaskedLM) |
| Initialization | Megatron init (full_megatron); pretrained from scratch on Hindi |
| Parameters | ~188M |
| Layers | 22 |
| Hidden size | 768 |
| Attention heads | 12 |
| Vocab size | 50,368 |
| Max sequence length | 8192 |
| Languages | Hindi (hi) |
| Pretraining tokens | ~23.6B (Sangraha) + ~4.85B (IndicCorp V2) |
| Hardware | 1× NVIDIA RTX 4090 (24 GB) |
| Training time | 5 days |
| Transformers | >=5.12.0 |
Eval summary
hindi-modernBERT is competitive on supervised Hindi understanding tasks and is the strongest model in this comparison on retrieval after DPR fine-tuning.
| Area | Benchmark | Score |
|---|---|---|
| NER | Naamapadam Hindi F1 | 0.8001 |
| Intent | MASSIVE Hindi Macro-F1 | 0.4731 |
| Retrieval | mMARCO Hindi nDCG@10 | 0.2825 |
| Retrieval | MLDR hi nDCG@10 | 0.2635 |
Checkpoint folders
The repository root contains the main ba1157 release. Earlier checkpoints are also available in subfolders:
| Hub path | What it contains |
|---|---|
. |
Main release, ba1157, 8192 context |
checkpoints/phase1 |
Phase-1 Sangraha checkpoint, 1024 context |
checkpoints/phase2_ba135 |
Phase-2 lowest MLM-loss checkpoint, 8192 context |
Usage
Main release:
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
model_id = "kkkamur07/hindi-modernbert"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
text = "à¤à¤¾à¤°à¤¤ [MASK] विशाल देश है।"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
logits = model(**inputs).logits
mask_idx = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
print(tokenizer.decode([logits[0, mask_idx].argmax(dim=-1).item()]))
Checkpoint subfolder:
from transformers import AutoModelForMaskedLM, AutoTokenizer
repo_id = "kkkamur07/hindi-modernbert"
subfolder = "checkpoints/phase1"
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
model = AutoModelForMaskedLM.from_pretrained(repo_id, subfolder=subfolder)
Apply Devanagari script normalisation + NFKC before tokenization for best results. See the training repo for preprocess_for_tokenizer().
For dense retrieval, fine-tune this base checkpoint with the DPR recipe on Hindi mMARCO triples. The training entrypoint is scripts/run_retrieval_finetune.py:
make retrieval-finetune ARGS="retrieval_ft.backbone=kkkamur07/hindi-modernbert retrieval_ft.max_seq_length=8192"
Evaluation
Benchmark numbers below fine-tune this base checkpoint on Naamapadam Hindi NER and MASSIVE Hindi intent. Retrieval numbers use a DPR fine-tuned checkpoint trained on Hindi mMARCO triples and evaluated on mmarco_hindi + MLDR hi.
Phase 1 → phase 2
| Stage | Pretraining | Max seq | NER F1 | MASSIVE Macro-F1 |
|---|---|---|---|---|
| Phase 1 | Sangraha Hindi | 1024 | 0.7963 | 0.3451 |
| hindi-modernBERT | + IndicCorp V2, 8192 context | 8192 | 0.8001 | 0.4731 |
| Δ | +0.0038 | +0.1279 |
Baseline comparison
| Model | Max seq | NER F1 | MASSIVE Macro-F1 |
|---|---|---|---|
| mmBERT-small | 8192 | 0.8347 | 0.5462 |
| xlm-roberta-base | 512 | 0.8214 | 0.0743 |
| muril-base-cased | 512 | 0.8148 | 0.0382 |
| IndicBERTv2-MLM-only | 128 | 0.8053 | 0.0821 |
| hindi-modernBERT | 8192 | 0.8001 | 0.4731 |
Retrieval
Retrieval uses a DPR fine-tuned checkpoint built from this base checkpoint.
After DPR fine-tuning on 1.25M mMARCO Hindi triplets, hindi-modernBERT outperforms the other Hindi baselines on both full mMARCO Hindi and long-document MLDR hi.
| Model | Max seq | mMARCO nDCG@10 | MLDR hi nDCG@10 |
|---|---|---|---|
| hindi-modernBERT | 8192 | 0.2825 | 0.2635 |
| mmBERT-small | 8192 | 0.2714 | 0.2337 |
| IndicBERTv2-MLM-only | 512 | 0.2821 | 0.1707 |
The 8192-token backbone is what separates hindi-modernBERT on long-document MLDR: IndicBERTv2 is close on short mMARCO but falls behind on MLDR hi because it cannot encode full long documents. These are DPR numbers, one vector per document.
Full retrieval benchmarks (DPR, 8192 context):
| Benchmark | What it measures | nDCG@10 | Recall@10 | MRR@10 |
|---|---|---|---|---|
| Selection (1k mMARCO Hindi) | Small held-out Hindi mMARCO validation split used to select the DPR learning rate/checkpoint. Not the final headline benchmark. | 0.8191 | 0.8987 | 0.7980 |
| mmarco_hindi (full) | Full Hindi MS MARCO-style passage retrieval benchmark. Tests standard dense retrieval quality on Hindi queries and passages. | 0.2825 | 0.4398 | 0.2368 |
| MLDR hi (8192 ctx) | Hindi long-document retrieval benchmark. Tests whether the 8192-token context helps retrieve long documents. | 0.2635 | 0.3900 | 0.2252 |
Metric glossary: nDCG@10 measures ranking quality in the top 10, Recall@10 measures whether relevant passages appear in the top 10, and MRR@10 measures how high the first relevant result appears.
Full report: eval_summary_report.md.
Data
Pretraining
| Corpus | Hub source | Tokens | Phase |
|---|---|---|---|
| Sangraha Hindi (verified + unverified + synthetic) | ai4bharat/sangraha |
~23.6B | Phase 1 |
IndicCorp V2 Hindi (hi-1 + hi-3) |
ai4bharat/IndicCorpV2 |
~4.85B | Phase 2 |
BPE tokenizer trained on Sangraha Hindi. MLM eval holdout: one withheld Sangraha shard (174k rows).
Downstream evaluation (Hindi fine-tune)
| Task | Dataset |
|---|---|
| NER | Naamapadam Hindi (hi) |
| Intent | MASSIVE Hindi (hi-IN) |
Transfer evaluation (source train → Hindi eval)
| Task | Train (source) | Eval (Hindi target) |
|---|---|---|
| Sentiment | amazon_reviews_multi (en) |
IndicSentiment (translation-hi) |
| QA | XTREME SQuAD (SQuAD) |
IndicQA (indicqa.hi) |
| COPA | Social IQa | IndicCOPA (translation-hi) |
Retrieval fine-tuning + evaluation
| Role | Dataset | Notes |
|---|---|---|
| DPR training | unicamp-dl/mmarco (hindi) |
1.25M query–positive–negative triples |
| DPR selection | mmarco_hindi | 1k-query held-out subset |
| Retrieval eval | mmarco_hindi (full) | Hindi MS MARCO-style passage retrieval |
| Retrieval eval | MLDR (language=hi) |
Long-document retrieval benchmark |
Limitations
- Hindi-only pretraining: other Indic languages are not in scope for this release.
- Tokenizer preprocessing: Devanagari script normalization is required for best results and is not applied automatically by
AutoTokenizer. - Not a retriever by itself: dense retrieval requires DPR fine-tuning on top of this base checkpoint.
- Single-seed evals: downstream and retrieval benchmarks use one fine-tuning seed; multi-seed averaging may shift scores slightly.
- Phase-1 holdout: MLM eval holdout was Sangraha-only; future runs may mix holdout sources for a more representative training signal.
Acknowledgments
- AI4Bharat: Sangraha, IndicCorp V2, Naamapadam, IndicSentiment, IndicQA, IndicCOPA
- Amazon Science: MASSIVE Hindi intent
- Transfer eval sources: amazon_reviews_multi, XTREME SQuAD, Social IQa
- Unicamp DL: mMARCO Hindi DPR training triples
- AIhnIndicRag: mmarco_hindi retrieval benchmark
- BGE / MLDR: MLDR long-document retrieval benchmark
- Answer.AI / ModernBERT team: architecture (arxiv:2412.13663)
Citation
@misc{hindi-modernbert2026,
title = {hindi-modernBERT: A Hindi ModernBERT Encoder with 8192 Context},
author = {Krrish Agarwalla},
year = {2026},
note = {Checkpoint ba1157. Base MLM; trained from scratch on Hindi.}
}
@article{modernbert2024,
title = {Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},
author = {Warner, Benjamin and Chizhov, Anton and Ermolaev, Alexander and others},
journal = {arXiv preprint arXiv:2412.13663},
year = {2024}
}
License
Apache 2.0.
- Downloads last month
- 218