MahaPOS-BERT: Marathi POS Tagging Model

Model Description

MahaPOS-BERT is a fine-tuned version of l3cube-pune/marathi-bert-v2 for Part-of-Speech (POS) tagging in Marathi. It is trained on the L3Cube-MahaPOS dataset — one of the first large-scale, manually annotated POS tagging datasets for Marathi — comprising 32,354 sentences drawn from Marathi news text.

This model is part of the L3Cube-MahaNLP project.
For more details refer our MahaPOS paper.

Model Details

Property	Value
Base Model	`l3cube-pune/marathi-bert-v2`
Model Type	BERT (BertForTokenClassification)
Task	Token Classification (POS Tagging)
Language	Marathi (`mr`)
Hidden Size	768
Attention Heads	12
Hidden Layers	12
Max Sequence Length	512
Vocab Size	197,285
Number of Labels	16

Label Set

The model uses a 16-tag set aligned with the Universal Dependencies (UD) v2 framework:

ID	Tag	Description
0	ADJ	Adjective
1	ADP	Adposition
2	ADV	Adverb
3	AUX	Auxiliary verb
4	CCONJ	Coordinating conjunction
5	DET	Determiner
6	INTJ	Interjection
7	NOUN	Common noun
8	NUM	Numeral
9	PART	Particle
10	POSTP	Postposition (low support — present in training, excluded from primary evaluation)
11	PRON	Pronoun
12	PROPN	Proper noun
13	PUNCT	Punctuation
14	SCONJ	Subordinating conjunction
15	VERB	Main verb

Note on POSTP: The POSTP tag has critically low corpus frequency (23 tokens total; 3 in the test set). It is retained in training but excluded from the primary macro-F1 evaluation metric. The X (foreign/unclassifiable) tag was removed entirely from training.

Performance

Metric	Value
Token-level Accuracy	88.67%
Macro-F1 (15 tags, primary)	81.67%
Macro-F1 (16 tags, incl. POSTP)	76.57%

Per-Tag F1 Scores (Test Set)

Tag	Precision	Recall	F1	Support
PUNCT	96.96	97.96	97.45	7,679
AUX	94.73	95.97	95.35	3,503
VERB	93.65	92.43	93.04	8,816
ADP	91.87	93.57	92.71	6,051
NOUN	91.30	91.92	91.61	21,910
NUM	87.19	89.93	88.54	2,482
CCONJ	87.56	86.88	87.22	1,555
PRON	84.65	84.34	84.49	2,216
DET	83.33	83.84	83.59	2,141
SCONJ	80.00	81.91	80.94	503
ADV	75.76	76.09	75.93	4,203
PART	77.74	74.15	75.91	650
ADJ	77.68	71.27	74.34	5,426
PROPN	56.96	62.32	59.52	1,510
INTJ	48.22	41.30	44.50	230

Training Details

Hyperparameter	Value
Learning Rate	5e-5
Batch Size	16
Max Epochs	10
Early Stopping Patience	3
Best Epoch	3
Warmup Ratio	0.10
Weight Decay	0.01
LR Schedule	Cosine with warmup
Hardware	NVIDIA Tesla T4 (15.6 GB)

Subword-to-token alignment uses the first subword strategy: only the first subword token of each word is used for prediction; continuations receive the ignore label (-100).

Dataset: L3Cube-MahaPOS

Split	Sentences	Tokens	Avg. Length
Train	22,652	332,418	14.7
Validation	4,848	71,163	14.7
Test	4,854	68,878	14.2
Total	32,354	472,459	14.6

The dataset was manually annotated by a team of Marathi-proficient annotators from PICT, Pune. Raw text was sourced from Marathi news portals covering politics, sports, culture, technology, and local affairs.

Usage

from transformers import pipeline

pipe = pipeline(
    "token-classification",
    model="l3cube-pune/marathi-pos-bert",  # update with actual HuggingFace repo path
    aggregation_strategy="first"
)

text = "भारत हा एक सुंदर देश आहे."
result = pipe(text)
for token in result:
    print(f"{token['word']:<15} {token['entity']}")

Loading Manually

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_path = "path/to/marathi_pos_final"  # local path or HF repo

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForTokenClassification.from_pretrained(model_path)

id2label = model.config.id2label

text = "नागपूर येथे मोठा कार्यक्रम झाला."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

predictions = outputs.logits.argmax(dim=-1)[0]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

for token, pred in zip(tokens, predictions):
    if token not in ["[CLS]", "[SEP]", "[PAD]"]:
        print(f"{token:<20} {id2label[pred.item()]}")

Model Files

File	Description
`model.safetensors`	Model weights (904 MB)
`config.json`	Model architecture config
`tokenizer.json`	Full tokenizer (WordPiece, 197K vocab)
`tokenizer_config.json`	Tokenizer settings
`label_map.json`	Label-to-ID mapping and evaluation metadata
`training_args.bin`	Training hyperparameters
`training_curves.png`	Loss and F1 curves over 10 epochs
`confusion_matrix.png`	Row-normalised confusion matrix (16 tags)

Limitations

Trained exclusively on formal news text; performance may degrade on informal, social media, or code-mixed Marathi.
PROPN (F1: 59.52%) and INTJ (F1: 44.50%) are the weakest classes due to lack of capitalisation in Marathi and data sparsity respectively.
POSTP (3 test tokens) cannot be reliably classified; treat its predictions as unreliable.
Requires a GPU with ≥12 GB VRAM for fine-tuning.

Citation

If you use this model or dataset, please cite:

@article{ingle2026l3cubemahapos,
  title={L3Cube-MahaPOS: A Marathi Part-of-Speech Tagging Dataset and BERT Models},
  author={Hariom Ingle and Ronit Ghode and Ishwari Gondkar and Jidnyasa Harad and Raviraj Joshi},
  journal={arXiv preprint arXiv:2606.24825},
  year={2026}
}

Acknowledgements

This work was carried out under the mentorship of L3Cube Labs, Pune. This work is part of the L3Cube-MahaNLP project.

Downloads last month: -

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for l3cube-pune/marathi-pos-tagger

Base model

l3cube-pune/marathi-bert-v2

Finetuned

(7)

this model

Dataset used to train l3cube-pune/marathi-pos-tagger

Paper for l3cube-pune/marathi-pos-tagger

L3Cube-MahaPOS: A Marathi Part-of-Speech Tagging Dataset and BERT Models

Paper • 2606.24825 • Published 3 days ago