Instructions to use l3cube-pune/marathi-pos-tagger with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use l3cube-pune/marathi-pos-tagger with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="l3cube-pune/marathi-pos-tagger")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("l3cube-pune/marathi-pos-tagger") model = AutoModelForTokenClassification.from_pretrained("l3cube-pune/marathi-pos-tagger") - Notebooks
- Google Colab
- Kaggle
MahaPOS-BERT: Marathi POS Tagging Model
Model Description
MahaPOS-BERT is a fine-tuned version of l3cube-pune/marathi-bert-v2 for Part-of-Speech (POS) tagging in Marathi. It is trained on the L3Cube-MahaPOS dataset — one of the first large-scale, manually annotated POS tagging datasets for Marathi — comprising 32,354 sentences drawn from Marathi news text.
This model is part of the L3Cube-MahaNLP project.
For more details refer our MahaPOS paper.
Model Details
| Property | Value |
|---|---|
| Base Model | l3cube-pune/marathi-bert-v2 |
| Model Type | BERT (BertForTokenClassification) |
| Task | Token Classification (POS Tagging) |
| Language | Marathi (mr) |
| Hidden Size | 768 |
| Attention Heads | 12 |
| Hidden Layers | 12 |
| Max Sequence Length | 512 |
| Vocab Size | 197,285 |
| Number of Labels | 16 |
Label Set
The model uses a 16-tag set aligned with the Universal Dependencies (UD) v2 framework:
| ID | Tag | Description |
|---|---|---|
| 0 | ADJ | Adjective |
| 1 | ADP | Adposition |
| 2 | ADV | Adverb |
| 3 | AUX | Auxiliary verb |
| 4 | CCONJ | Coordinating conjunction |
| 5 | DET | Determiner |
| 6 | INTJ | Interjection |
| 7 | NOUN | Common noun |
| 8 | NUM | Numeral |
| 9 | PART | Particle |
| 10 | POSTP | Postposition (low support — present in training, excluded from primary evaluation) |
| 11 | PRON | Pronoun |
| 12 | PROPN | Proper noun |
| 13 | PUNCT | Punctuation |
| 14 | SCONJ | Subordinating conjunction |
| 15 | VERB | Main verb |
Note on POSTP: The
POSTPtag has critically low corpus frequency (23 tokens total; 3 in the test set). It is retained in training but excluded from the primary macro-F1 evaluation metric. TheX(foreign/unclassifiable) tag was removed entirely from training.
Performance
| Metric | Value |
|---|---|
| Token-level Accuracy | 88.67% |
| Macro-F1 (15 tags, primary) | 81.67% |
| Macro-F1 (16 tags, incl. POSTP) | 76.57% |
Per-Tag F1 Scores (Test Set)
| Tag | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| PUNCT | 96.96 | 97.96 | 97.45 | 7,679 |
| AUX | 94.73 | 95.97 | 95.35 | 3,503 |
| VERB | 93.65 | 92.43 | 93.04 | 8,816 |
| ADP | 91.87 | 93.57 | 92.71 | 6,051 |
| NOUN | 91.30 | 91.92 | 91.61 | 21,910 |
| NUM | 87.19 | 89.93 | 88.54 | 2,482 |
| CCONJ | 87.56 | 86.88 | 87.22 | 1,555 |
| PRON | 84.65 | 84.34 | 84.49 | 2,216 |
| DET | 83.33 | 83.84 | 83.59 | 2,141 |
| SCONJ | 80.00 | 81.91 | 80.94 | 503 |
| ADV | 75.76 | 76.09 | 75.93 | 4,203 |
| PART | 77.74 | 74.15 | 75.91 | 650 |
| ADJ | 77.68 | 71.27 | 74.34 | 5,426 |
| PROPN | 56.96 | 62.32 | 59.52 | 1,510 |
| INTJ | 48.22 | 41.30 | 44.50 | 230 |
Training Details
| Hyperparameter | Value |
|---|---|
| Learning Rate | 5e-5 |
| Batch Size | 16 |
| Max Epochs | 10 |
| Early Stopping Patience | 3 |
| Best Epoch | 3 |
| Warmup Ratio | 0.10 |
| Weight Decay | 0.01 |
| LR Schedule | Cosine with warmup |
| Hardware | NVIDIA Tesla T4 (15.6 GB) |
Subword-to-token alignment uses the first subword strategy: only the first subword token of each word is used for prediction; continuations receive the ignore label (-100).
Dataset: L3Cube-MahaPOS
| Split | Sentences | Tokens | Avg. Length |
|---|---|---|---|
| Train | 22,652 | 332,418 | 14.7 |
| Validation | 4,848 | 71,163 | 14.7 |
| Test | 4,854 | 68,878 | 14.2 |
| Total | 32,354 | 472,459 | 14.6 |
The dataset was manually annotated by a team of Marathi-proficient annotators from PICT, Pune. Raw text was sourced from Marathi news portals covering politics, sports, culture, technology, and local affairs.
Usage
from transformers import pipeline
pipe = pipeline(
"token-classification",
model="l3cube-pune/marathi-pos-bert", # update with actual HuggingFace repo path
aggregation_strategy="first"
)
text = "भारत हा एक सुंदर देश आहे."
result = pipe(text)
for token in result:
print(f"{token['word']:<15} {token['entity']}")
Loading Manually
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_path = "path/to/marathi_pos_final" # local path or HF repo
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForTokenClassification.from_pretrained(model_path)
id2label = model.config.id2label
text = "नागपूर येथे मोठा कार्यक्रम झाला."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)[0]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for token, pred in zip(tokens, predictions):
if token not in ["[CLS]", "[SEP]", "[PAD]"]:
print(f"{token:<20} {id2label[pred.item()]}")
Model Files
| File | Description |
|---|---|
model.safetensors |
Model weights (904 MB) |
config.json |
Model architecture config |
tokenizer.json |
Full tokenizer (WordPiece, 197K vocab) |
tokenizer_config.json |
Tokenizer settings |
label_map.json |
Label-to-ID mapping and evaluation metadata |
training_args.bin |
Training hyperparameters |
training_curves.png |
Loss and F1 curves over 10 epochs |
confusion_matrix.png |
Row-normalised confusion matrix (16 tags) |
Limitations
- Trained exclusively on formal news text; performance may degrade on informal, social media, or code-mixed Marathi.
- PROPN (F1: 59.52%) and INTJ (F1: 44.50%) are the weakest classes due to lack of capitalisation in Marathi and data sparsity respectively.
- POSTP (3 test tokens) cannot be reliably classified; treat its predictions as unreliable.
- Requires a GPU with ≥12 GB VRAM for fine-tuning.
Citation
If you use this model or dataset, please cite:
@article{ingle2026l3cubemahapos,
title={L3Cube-MahaPOS: A Marathi Part-of-Speech Tagging Dataset and BERT Models},
author={Hariom Ingle and Ronit Ghode and Ishwari Gondkar and Jidnyasa Harad and Raviraj Joshi},
journal={arXiv preprint arXiv:2606.24825},
year={2026}
}
Acknowledgements
This work was carried out under the mentorship of L3Cube Labs, Pune. This work is part of the L3Cube-MahaNLP project.
- Downloads last month
- -
Model tree for l3cube-pune/marathi-pos-tagger
Base model
l3cube-pune/marathi-bert-v2