xlm-roberta-base-fintuned-panx-ta-hi
This model is a fine-tuned version of xlm-roberta-base on the PAN-X dataset for Tamil (ta) and Hindi (hi). It is fine-tuned for Named Entity Recognition (NER) and achieves the following results on the evaluation set:
- Loss: 0.2480
- F1: 0.8347
Model Description
The model is based on XLM-RoBERTa, a multilingual transformer-based architecture, and fine-tuned for NER tasks in Tamil and Hindi. Entity type : LOC (Location), PER (Person), and ORG (Organization)
B- prefix indicates beginning of an entity and I - prefix indicates consecutive entity
Intended Uses & Limitations
Intended Uses:
- Named Entity Recognition (NER) tasks in Tamil and Hindi.
Limitations:
- Performance may degrade on languages or domains not included in the training data.
- Not intended for general text classification or other NLP tasks.
How to Use the Model
You can load and use the model for Named Entity Recognition as follows:
Installation
Ensure you have the transformers
and torch
libraries installed. Install them via pip if necessary:
pip install transformers torch
Code Example
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
# Load the tokenizer and model
model_name = "Lokeshwaran/xlm-roberta-base-fintuned-panx-ta-hi"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Create an NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Example text in Tamil and Hindi
example_texts = [
"அப்துல் கலாம் சென்னை நகரத்தில் ஐஎஸ்ஆர்ஓ நிறுவனத்துக்கு சென்றார்.", # Abdul Kalam went to the ISRO organization in Chennai city.
"सचिन तेंदुलकर ने मुंबई में बीसीसीआई के कार्यालय का दौरा किया।", # Hindi: Sachin Tendulkar visited the BCCI office in Mumbai.
"മഹാത്മാ ഗാന്ധി തിരുവനന്തപുരം നഗരത്തിലെ ഐഎസ്ആർഒ ഓഫീസ് സന്ദർശിച്ചു." # Malayalam: Mahatma Gandhi visited the ISRO office in Thiruvananthapuram city.
]
# Perform Named Entity Recognition
for text in example_texts:
results = ner_pipeline(text)
print(f"Input Text: {text}")
for entity in results:
print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.2f}")
print()
Training and Evaluation Data
The model was fine-tuned on the PAN-X dataset, which is part of the XTREME benchmark, specifically for Tamil and Hindi.
Training Procedure
Hyperparameters
- Learning Rate:
5e-05
- Batch Size:
24
(both training and evaluation) - Epochs:
3
- Optimizer:
AdamW
withbetas=(0.9, 0.999)
andepsilon=1e-08
- Learning Rate Scheduler:
Linear
Results
Epoch | Training Loss | Validation Loss | F1 |
---|---|---|---|
1.0 | 0.1886 | 0.2413 | 0.8096 |
2.0 | 0.1252 | 0.2415 | 0.8201 |
3.0 | 0.0752 | 0.2480 | 0.8347 |
Framework Versions
- Transformers: 4.47.1
- PyTorch: 2.5.1+cu121
- Datasets: 3.2.0
- Tokenizers: 0.21.0
- Downloads last month
- 7
Dataset used to train Lokeshwaran/xlm-roberta-base-fintuned-panx-ta-hi
Evaluation results
- f1 on PAN-Xself-reported0.835
- loss on PAN-Xself-reported0.248