hindi-modernBERT

Model Size Language License GitHub

TL;DR: This is a pretrained base MLM checkpoint: a Hindi extension of the ModernBERT architecture, trained from scratch on Hindi text. The base model is competitive with other models across tasks, and it outperforms them on retrieval after DPR fine-tuning. Checkpoint ba1157 · 8192 context · Hindi BPE vocab 50,368.

This release uses the ModernBERT architecture and training recipe, adapted for Hindi with a new tokenizer and ~28B tokens of Hindi pretraining.

Model summary

Type Base MLM checkpoint (pretrained, not task fine-tuned)
Architecture ModernBERT (ModernBertForMaskedLM)
Initialization Megatron init (full_megatron); pretrained from scratch on Hindi
Parameters ~188M
Layers 22
Hidden size 768
Attention heads 12
Vocab size 50,368
Max sequence length 8192
Languages Hindi (hi)
Pretraining tokens ~23.6B (Sangraha) + ~4.85B (IndicCorp V2)
Hardware 1× NVIDIA RTX 4090 (24 GB)
Training time 5 days
Transformers >=5.12.0

Eval summary

hindi-modernBERT is competitive on supervised Hindi understanding tasks and is the strongest model in this comparison on retrieval after DPR fine-tuning.

Area Benchmark Score
NER Naamapadam Hindi F1 0.8001
Intent MASSIVE Hindi Macro-F1 0.4731
Retrieval mMARCO Hindi nDCG@10 0.2825
Retrieval MLDR hi nDCG@10 0.2635

Checkpoint folders

The repository root contains the main ba1157 release. Earlier checkpoints are also available in subfolders:

Hub path What it contains
. Main release, ba1157, 8192 context
checkpoints/phase1 Phase-1 Sangraha checkpoint, 1024 context
checkpoints/phase2_ba135 Phase-2 lowest MLM-loss checkpoint, 8192 context

Usage

Main release:

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

model_id = "kkkamur07/hindi-modernbert"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

text = "भारत [MASK] विशाल देश है।"
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    logits = model(**inputs).logits

mask_idx = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
print(tokenizer.decode([logits[0, mask_idx].argmax(dim=-1).item()]))

Checkpoint subfolder:

from transformers import AutoModelForMaskedLM, AutoTokenizer

repo_id = "kkkamur07/hindi-modernbert"
subfolder = "checkpoints/phase1"

tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
model = AutoModelForMaskedLM.from_pretrained(repo_id, subfolder=subfolder)

Apply Devanagari script normalisation + NFKC before tokenization for best results. See the training repo for preprocess_for_tokenizer().

For dense retrieval, fine-tune this base checkpoint with the DPR recipe on Hindi mMARCO triples. The training entrypoint is scripts/run_retrieval_finetune.py:

make retrieval-finetune ARGS="retrieval_ft.backbone=kkkamur07/hindi-modernbert retrieval_ft.max_seq_length=8192"

Evaluation

Benchmark numbers below fine-tune this base checkpoint on Naamapadam Hindi NER and MASSIVE Hindi intent. Retrieval numbers use a DPR fine-tuned checkpoint trained on Hindi mMARCO triples and evaluated on mmarco_hindi + MLDR hi.

Phase 1 → phase 2

Stage Pretraining Max seq NER F1 MASSIVE Macro-F1
Phase 1 Sangraha Hindi 1024 0.7963 0.3451
hindi-modernBERT + IndicCorp V2, 8192 context 8192 0.8001 0.4731
Δ +0.0038 +0.1279

Baseline comparison

Model Max seq NER F1 MASSIVE Macro-F1
mmBERT-small 8192 0.8347 0.5462
xlm-roberta-base 512 0.8214 0.0743
muril-base-cased 512 0.8148 0.0382
IndicBERTv2-MLM-only 128 0.8053 0.0821
hindi-modernBERT 8192 0.8001 0.4731

Retrieval

Retrieval uses a DPR fine-tuned checkpoint built from this base checkpoint.

After DPR fine-tuning on 1.25M mMARCO Hindi triplets, hindi-modernBERT outperforms the other Hindi baselines on both full mMARCO Hindi and long-document MLDR hi.

Model Max seq mMARCO nDCG@10 MLDR hi nDCG@10
hindi-modernBERT 8192 0.2825 0.2635
mmBERT-small 8192 0.2714 0.2337
IndicBERTv2-MLM-only 512 0.2821 0.1707

The 8192-token backbone is what separates hindi-modernBERT on long-document MLDR: IndicBERTv2 is close on short mMARCO but falls behind on MLDR hi because it cannot encode full long documents. These are DPR numbers, one vector per document.

Full retrieval benchmarks (DPR, 8192 context):

Benchmark What it measures nDCG@10 Recall@10 MRR@10
Selection (1k mMARCO Hindi) Small held-out Hindi mMARCO validation split used to select the DPR learning rate/checkpoint. Not the final headline benchmark. 0.8191 0.8987 0.7980
mmarco_hindi (full) Full Hindi MS MARCO-style passage retrieval benchmark. Tests standard dense retrieval quality on Hindi queries and passages. 0.2825 0.4398 0.2368
MLDR hi (8192 ctx) Hindi long-document retrieval benchmark. Tests whether the 8192-token context helps retrieve long documents. 0.2635 0.3900 0.2252

Metric glossary: nDCG@10 measures ranking quality in the top 10, Recall@10 measures whether relevant passages appear in the top 10, and MRR@10 measures how high the first relevant result appears.

Full report: eval_summary_report.md.

Data

Pretraining

Corpus Hub source Tokens Phase
Sangraha Hindi (verified + unverified + synthetic) ai4bharat/sangraha ~23.6B Phase 1
IndicCorp V2 Hindi (hi-1 + hi-3) ai4bharat/IndicCorpV2 ~4.85B Phase 2

BPE tokenizer trained on Sangraha Hindi. MLM eval holdout: one withheld Sangraha shard (174k rows).

Downstream evaluation (Hindi fine-tune)

Task Dataset
NER Naamapadam Hindi (hi)
Intent MASSIVE Hindi (hi-IN)

Transfer evaluation (source train → Hindi eval)

Task Train (source) Eval (Hindi target)
Sentiment amazon_reviews_multi (en) IndicSentiment (translation-hi)
QA XTREME SQuAD (SQuAD) IndicQA (indicqa.hi)
COPA Social IQa IndicCOPA (translation-hi)

Retrieval fine-tuning + evaluation

Role Dataset Notes
DPR training unicamp-dl/mmarco (hindi) 1.25M query–positive–negative triples
DPR selection mmarco_hindi 1k-query held-out subset
Retrieval eval mmarco_hindi (full) Hindi MS MARCO-style passage retrieval
Retrieval eval MLDR (language=hi) Long-document retrieval benchmark

Limitations

  • Hindi-only pretraining: other Indic languages are not in scope for this release.
  • Tokenizer preprocessing: Devanagari script normalization is required for best results and is not applied automatically by AutoTokenizer.
  • Not a retriever by itself: dense retrieval requires DPR fine-tuning on top of this base checkpoint.
  • Single-seed evals: downstream and retrieval benchmarks use one fine-tuning seed; multi-seed averaging may shift scores slightly.
  • Phase-1 holdout: MLM eval holdout was Sangraha-only; future runs may mix holdout sources for a more representative training signal.

Acknowledgments

Citation

@misc{hindi-modernbert2026,
  title  = {hindi-modernBERT: A Hindi ModernBERT Encoder with 8192 Context},
  author = {Krrish Agarwalla},
  year   = {2026},
  note   = {Checkpoint ba1157. Base MLM; trained from scratch on Hindi.}
}
@article{modernbert2024,
  title  = {Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},
  author = {Warner, Benjamin and Chizhov, Anton and Ermolaev, Alexander and others},
  journal = {arXiv preprint arXiv:2412.13663},
  year   = {2024}
}

License

Apache 2.0.

Downloads last month
218
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train kkkamur07/hindi-modernbert

Paper for kkkamur07/hindi-modernbert