hindi-modernBERT

TL;DR: This is a pretrained base MLM checkpoint: a Hindi extension of the ModernBERT architecture, trained from scratch on Hindi text. The base model is competitive with other models across tasks, and it outperforms them on retrieval after DPR fine-tuning. Checkpoint ba1157 · 8192 context · Hindi BPE vocab 50,368.

This release uses the ModernBERT architecture and training recipe, adapted for Hindi with a new tokenizer and ~28B tokens of Hindi pretraining.

Model summary


Type	Base MLM checkpoint (pretrained, not task fine-tuned)
Architecture	ModernBERT (`ModernBertForMaskedLM`)
Initialization	Megatron init (`full_megatron`); pretrained from scratch on Hindi
Parameters	~188M
Layers	22
Hidden size	768
Attention heads	12
Vocab size	50,368
Max sequence length	8192
Languages	Hindi (`hi`)
Pretraining tokens	~23.6B (Sangraha) + ~4.85B (IndicCorp V2)
Hardware	1× NVIDIA RTX 4090 (24 GB)
Training time	5 days
Transformers	`>=5.12.0`

Eval summary

hindi-modernBERT is competitive on supervised Hindi understanding tasks and is the strongest model in this comparison on retrieval after DPR fine-tuning.

Area	Benchmark	Score
NER	Naamapadam Hindi F1	0.8001
Intent	MASSIVE Hindi Macro-F1	0.4731
Retrieval	mMARCO Hindi nDCG@10	0.2825
Retrieval	MLDR hi nDCG@10	0.2635

Checkpoint folders

The repository root contains the main ba1157 release. Earlier checkpoints are also available in subfolders:

Hub path	What it contains
`.`	Main release, ba1157, 8192 context
`checkpoints/phase1`	Phase-1 Sangraha checkpoint, 1024 context
`checkpoints/phase2_ba135`	Phase-2 lowest MLM-loss checkpoint, 8192 context

Usage

Main release:

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

model_id = "kkkamur07/hindi-modernbert"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

text = "भारत [MASK] विशाल देश है।"
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    logits = model(**inputs).logits

mask_idx = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
print(tokenizer.decode([logits[0, mask_idx].argmax(dim=-1).item()]))

Checkpoint subfolder:

from transformers import AutoModelForMaskedLM, AutoTokenizer

repo_id = "kkkamur07/hindi-modernbert"
subfolder = "checkpoints/phase1"

tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
model = AutoModelForMaskedLM.from_pretrained(repo_id, subfolder=subfolder)

Apply Devanagari script normalisation + NFKC before tokenization for best results. See the training repo for preprocess_for_tokenizer().

For dense retrieval, fine-tune this base checkpoint with the DPR recipe on Hindi mMARCO triples. The training entrypoint is scripts/run_retrieval_finetune.py:

make retrieval-finetune ARGS="retrieval_ft.backbone=kkkamur07/hindi-modernbert retrieval_ft.max_seq_length=8192"

Evaluation

Benchmark numbers below fine-tune this base checkpoint on Naamapadam Hindi NER and MASSIVE Hindi intent. Retrieval numbers use a DPR fine-tuned checkpoint trained on Hindi mMARCO triples and evaluated on mmarco_hindi + MLDR hi.

Phase 1 → phase 2

Stage	Pretraining	Max seq	NER F1	MASSIVE Macro-F1
Phase 1	Sangraha Hindi	1024	0.7963	0.3451
hindi-modernBERT	+ IndicCorp V2, 8192 context	8192	0.8001	0.4731
Δ			+0.0038	+0.1279

Baseline comparison

Model	Max seq	NER F1	MASSIVE Macro-F1
mmBERT-small	8192	0.8347	0.5462
xlm-roberta-base	512	0.8214	0.0743
muril-base-cased	512	0.8148	0.0382
IndicBERTv2-MLM-only	128	0.8053	0.0821
hindi-modernBERT	8192	0.8001	0.4731

Retrieval

Retrieval uses a DPR fine-tuned checkpoint built from this base checkpoint.

After DPR fine-tuning on 1.25M mMARCO Hindi triplets, hindi-modernBERT outperforms the other Hindi baselines on both full mMARCO Hindi and long-document MLDR hi.

Model	Max seq	mMARCO nDCG@10	MLDR hi nDCG@10
hindi-modernBERT	8192	0.2825	0.2635
mmBERT-small	8192	0.2714	0.2337
IndicBERTv2-MLM-only	512	0.2821	0.1707

The 8192-token backbone is what separates hindi-modernBERT on long-document MLDR: IndicBERTv2 is close on short mMARCO but falls behind on MLDR hi because it cannot encode full long documents. These are DPR numbers, one vector per document.

Full retrieval benchmarks (DPR, 8192 context):

Benchmark	What it measures	nDCG@10	Recall@10	MRR@10
Selection (1k mMARCO Hindi)	Small held-out Hindi mMARCO validation split used to select the DPR learning rate/checkpoint. Not the final headline benchmark.	0.8191	0.8987	0.7980
mmarco_hindi (full)	Full Hindi MS MARCO-style passage retrieval benchmark. Tests standard dense retrieval quality on Hindi queries and passages.	0.2825	0.4398	0.2368
MLDR hi (8192 ctx)	Hindi long-document retrieval benchmark. Tests whether the 8192-token context helps retrieve long documents.	0.2635	0.3900	0.2252

Metric glossary: nDCG@10 measures ranking quality in the top 10, Recall@10 measures whether relevant passages appear in the top 10, and MRR@10 measures how high the first relevant result appears.

Full report: eval_summary_report.md.

Data

Pretraining

Corpus	Hub source	Tokens	Phase
Sangraha Hindi (verified + unverified + synthetic)	`ai4bharat/sangraha`	~23.6B	Phase 1
IndicCorp V2 Hindi (`hi-1` + `hi-3`)	`ai4bharat/IndicCorpV2`	~4.85B	Phase 2

BPE tokenizer trained on Sangraha Hindi. MLM eval holdout: one withheld Sangraha shard (174k rows).

Downstream evaluation (Hindi fine-tune)

Task	Dataset
NER	Naamapadam Hindi (`hi`)
Intent	MASSIVE Hindi (`hi-IN`)

Transfer evaluation (source train → Hindi eval)

Task	Train (source)	Eval (Hindi target)
Sentiment	amazon_reviews_multi (`en`)	IndicSentiment (`translation-hi`)
QA	XTREME SQuAD (`SQuAD`)	IndicQA (`indicqa.hi`)
COPA	Social IQa	IndicCOPA (`translation-hi`)

Retrieval fine-tuning + evaluation

Role	Dataset	Notes
DPR training	unicamp-dl/mmarco (`hindi`)	1.25M query–positive–negative triples
DPR selection	mmarco_hindi	1k-query held-out subset
Retrieval eval	mmarco_hindi (full)	Hindi MS MARCO-style passage retrieval
Retrieval eval	MLDR (`language=hi`)	Long-document retrieval benchmark

Limitations

Hindi-only pretraining: other Indic languages are not in scope for this release.
Tokenizer preprocessing: Devanagari script normalization is required for best results and is not applied automatically by AutoTokenizer.
Not a retriever by itself: dense retrieval requires DPR fine-tuning on top of this base checkpoint.
Single-seed evals: downstream and retrieval benchmarks use one fine-tuning seed; multi-seed averaging may shift scores slightly.
Phase-1 holdout: MLM eval holdout was Sangraha-only; future runs may mix holdout sources for a more representative training signal.

Acknowledgments

AI4Bharat: Sangraha, IndicCorp V2, Naamapadam, IndicSentiment, IndicQA, IndicCOPA
Amazon Science: MASSIVE Hindi intent
Transfer eval sources: amazon_reviews_multi, XTREME SQuAD, Social IQa
Unicamp DL: mMARCO Hindi DPR training triples
AIhnIndicRag: mmarco_hindi retrieval benchmark
BGE / MLDR: MLDR long-document retrieval benchmark
Answer.AI / ModernBERT team: architecture (arxiv:2412.13663)

Citation

@misc{hindi-modernbert2026,
  title  = {hindi-modernBERT: A Hindi ModernBERT Encoder with 8192 Context},
  author = {Krrish Agarwalla},
  year   = {2026},
  note   = {Checkpoint ba1157. Base MLM; trained from scratch on Hindi.}
}

@article{modernbert2024,
  title  = {Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},
  author = {Warner, Benjamin and Chizhov, Anton and Ermolaev, Alexander and others},
  journal = {arXiv preprint arXiv:2412.13663},
  year   = {2024}
}

License

Apache 2.0.

Downloads last month: 218

Safetensors

Model size

0.1B params

Tensor type

F32

Datasets used to train kkkamur07/hindi-modernbert

Paper for kkkamur07/hindi-modernbert

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published Dec 18, 2024 • 167