MT5 Sindhi Question Answering — SdQuAD
Model Description
This is the first publicly available Sindhi Question Answering model, fine-tuned on the SdQuAD dataset — the only Sindhi QA dataset in existence.
Sindhi is a low-resource South Asian language spoken by 30+ million people primarily in Sindh, Pakistan. This model addresses a critical gap in NLP resources for the Sindhi language.
Developed by: Ali Nawaz
University: Shaikh Ayaz University Shikarpur, Pakistan
Base model: google/mt5-base
Language: Sindhi (سنڌي) — Perso-Arabic script
Task: Question Answering (Generative)
How to Use
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
tokenizer = AutoTokenizer.from_pretrained('alinawazmahar/mt5-sindhi-qa-sdquad')
model = AutoModelForSeq2SeqLM.from_pretrained('alinawazmahar/mt5-sindhi-qa-sdquad')
model.eval()
def ask_sindhi(question):
input_text = f'سنڌي سوال: {question}'
inputs = tokenizer(
input_text,
return_tensors='pt',
max_length=128,
truncation=True
)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=64,
num_beams=4,
early_stopping=True
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Example
print(ask_sindhi('انرشيا جو مطلب ڇا آهي؟'))
# Output: جسم جي حرڪت يا سڪون کي جاري رکڻ جي صلاحيت.
Training Details
| Parameter | Value |
|---|---|
| Base model | google/mt5-base |
| Dataset | Aliwj/SdQuAD |
| Train samples | 9,596 |
| Validation samples | 1,199 |
| Test samples | 1,200 |
| Epochs | 10 |
| Batch size | 16 (effective) |
| Learning rate | 5e-4 |
| Optimizer | Adafactor |
| Hardware | Kaggle T4 GPU |
| Training time | ~10 hours |
Evaluation Results
Evaluated on SdQuAD test set (1,200 samples):
| Metric | Score |
|---|---|
| F1 | 50.06 |
| Exact Match | 22.08 |
| ROUGE-1 | 8.18 |
| ROUGE-L | 8.18 |
Sample Predictions
| Question | Predicted Answer | Correct? |
|---|---|---|
| انرشيا جو مطلب ڇا آهي؟ | جسم جي حرڪت يا سڪون کي جاري رکڻ جي صلاحيت. | ✅ |
| پاڪستان جو وڏو شهر ڪهڙو آهي؟ | پاڪستان جو وڏو شهر ڪراچي آهي. | ✅ |
| سيل جي ميمبرين ڪهڙن ٻن مکيه ماليڪيولن مان ٺهيل هوندي آهي؟ | سيل جي ميمبرين پروٽين ۽ پروٽين مان ٺهيل هوندي آهي. | ⚠️ Partial |
Limitations
- This is a generative QA model — it generates answers without reading a context paragraph. This means it relies on knowledge learned during training rather than extracting answers from provided text.
- May hallucinate answers for questions not well-represented in the training data.
- Performance is lower than extractive QA models (baseline F1: 81.47 from SdQuAD paper) due to the harder generative task.
- v2.0 coming soon with context-aware extractive QA and improved F1.
Roadmap
- v1.0 — Generative QA baseline (F1: 50.06)
- v2.0 — Improved hyperparameters (target F1: 60+)
- v3.0 — Context-aware extractive QA (target F1: 80+)
- Gradio demo on HuggingFace Spaces
Citation
If you use this model in your research, please cite:
@misc{nawaz2026sindhiqa,
author = {Ali Nawaz},
title = {MT5 Sindhi Question Answering Model},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/alinawazmahar/mt5-sindhi-qa-sdquad}
}
Also cite the SdQuAD dataset:
@inproceedings{ali2026sdquad,
title = {SdQuAD: A Large Benchmark Question Answering Dataset for Low-resource Sindhi Language},
author = {Wazir Ali et al.},
booktitle = {RESOURCEFUL-2026, LREC},
year = {2026}
}
Contact
Ali Nawaz
Shaikh Ayaz University Shikarpur, Pakistan
LinkedIn: Ali Nawaz
This model is part of ongoing research in Sindhi NLP — a severely under-resourced language deserving more attention from the global NLP community.
- Downloads last month
- 182
Model tree for alinawazmahar/mt5-sindhi-qa-sdquad
Base model
google/mt5-base