MahaPOS-BERT: Marathi POS Tagging Model

Model Description

MahaPOS-BERT is a fine-tuned version of l3cube-pune/marathi-bert-v2 for Part-of-Speech (POS) tagging in Marathi. It is trained on the L3Cube-MahaPOS dataset — one of the first large-scale, manually annotated POS tagging datasets for Marathi — comprising 32,354 sentences drawn from Marathi news text.

This model is part of the L3Cube-MahaNLP project.
For more details refer our MahaPOS paper.


Model Details

Property Value
Base Model l3cube-pune/marathi-bert-v2
Model Type BERT (BertForTokenClassification)
Task Token Classification (POS Tagging)
Language Marathi (mr)
Hidden Size 768
Attention Heads 12
Hidden Layers 12
Max Sequence Length 512
Vocab Size 197,285
Number of Labels 16

Label Set

The model uses a 16-tag set aligned with the Universal Dependencies (UD) v2 framework:

ID Tag Description
0 ADJ Adjective
1 ADP Adposition
2 ADV Adverb
3 AUX Auxiliary verb
4 CCONJ Coordinating conjunction
5 DET Determiner
6 INTJ Interjection
7 NOUN Common noun
8 NUM Numeral
9 PART Particle
10 POSTP Postposition (low support — present in training, excluded from primary evaluation)
11 PRON Pronoun
12 PROPN Proper noun
13 PUNCT Punctuation
14 SCONJ Subordinating conjunction
15 VERB Main verb

Note on POSTP: The POSTP tag has critically low corpus frequency (23 tokens total; 3 in the test set). It is retained in training but excluded from the primary macro-F1 evaluation metric. The X (foreign/unclassifiable) tag was removed entirely from training.


Performance

Metric Value
Token-level Accuracy 88.67%
Macro-F1 (15 tags, primary) 81.67%
Macro-F1 (16 tags, incl. POSTP) 76.57%

Per-Tag F1 Scores (Test Set)

Tag Precision Recall F1 Support
PUNCT 96.96 97.96 97.45 7,679
AUX 94.73 95.97 95.35 3,503
VERB 93.65 92.43 93.04 8,816
ADP 91.87 93.57 92.71 6,051
NOUN 91.30 91.92 91.61 21,910
NUM 87.19 89.93 88.54 2,482
CCONJ 87.56 86.88 87.22 1,555
PRON 84.65 84.34 84.49 2,216
DET 83.33 83.84 83.59 2,141
SCONJ 80.00 81.91 80.94 503
ADV 75.76 76.09 75.93 4,203
PART 77.74 74.15 75.91 650
ADJ 77.68 71.27 74.34 5,426
PROPN 56.96 62.32 59.52 1,510
INTJ 48.22 41.30 44.50 230

Training Details

Hyperparameter Value
Learning Rate 5e-5
Batch Size 16
Max Epochs 10
Early Stopping Patience 3
Best Epoch 3
Warmup Ratio 0.10
Weight Decay 0.01
LR Schedule Cosine with warmup
Hardware NVIDIA Tesla T4 (15.6 GB)

Subword-to-token alignment uses the first subword strategy: only the first subword token of each word is used for prediction; continuations receive the ignore label (-100).


Dataset: L3Cube-MahaPOS

Split Sentences Tokens Avg. Length
Train 22,652 332,418 14.7
Validation 4,848 71,163 14.7
Test 4,854 68,878 14.2
Total 32,354 472,459 14.6

The dataset was manually annotated by a team of Marathi-proficient annotators from PICT, Pune. Raw text was sourced from Marathi news portals covering politics, sports, culture, technology, and local affairs.


Usage

from transformers import pipeline

pipe = pipeline(
    "token-classification",
    model="l3cube-pune/marathi-pos-bert",  # update with actual HuggingFace repo path
    aggregation_strategy="first"
)

text = "भारत हा एक सुंदर देश आहे."
result = pipe(text)
for token in result:
    print(f"{token['word']:<15} {token['entity']}")

Loading Manually

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_path = "path/to/marathi_pos_final"  # local path or HF repo

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForTokenClassification.from_pretrained(model_path)

id2label = model.config.id2label

text = "नागपूर येथे मोठा कार्यक्रम झाला."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

predictions = outputs.logits.argmax(dim=-1)[0]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

for token, pred in zip(tokens, predictions):
    if token not in ["[CLS]", "[SEP]", "[PAD]"]:
        print(f"{token:<20} {id2label[pred.item()]}")

Model Files

File Description
model.safetensors Model weights (904 MB)
config.json Model architecture config
tokenizer.json Full tokenizer (WordPiece, 197K vocab)
tokenizer_config.json Tokenizer settings
label_map.json Label-to-ID mapping and evaluation metadata
training_args.bin Training hyperparameters
training_curves.png Loss and F1 curves over 10 epochs
confusion_matrix.png Row-normalised confusion matrix (16 tags)

Limitations

  • Trained exclusively on formal news text; performance may degrade on informal, social media, or code-mixed Marathi.
  • PROPN (F1: 59.52%) and INTJ (F1: 44.50%) are the weakest classes due to lack of capitalisation in Marathi and data sparsity respectively.
  • POSTP (3 test tokens) cannot be reliably classified; treat its predictions as unreliable.
  • Requires a GPU with ≥12 GB VRAM for fine-tuning.

Citation

If you use this model or dataset, please cite:

@article{ingle2026l3cubemahapos,
  title={L3Cube-MahaPOS: A Marathi Part-of-Speech Tagging Dataset and BERT Models},
  author={Hariom Ingle and Ronit Ghode and Ishwari Gondkar and Jidnyasa Harad and Raviraj Joshi},
  journal={arXiv preprint arXiv:2606.24825},
  year={2026}
}

Acknowledgements

This work was carried out under the mentorship of L3Cube Labs, Pune. This work is part of the L3Cube-MahaNLP project.

Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for l3cube-pune/marathi-pos-tagger

Finetuned
(7)
this model

Dataset used to train l3cube-pune/marathi-pos-tagger

Paper for l3cube-pune/marathi-pos-tagger