MT5 Sindhi Question Answering — SdQuAD

Model Description

This is the first publicly available Sindhi Question Answering model, fine-tuned on the SdQuAD dataset — the only Sindhi QA dataset in existence.

Sindhi is a low-resource South Asian language spoken by 30+ million people primarily in Sindh, Pakistan. This model addresses a critical gap in NLP resources for the Sindhi language.

Developed by: Ali Nawaz
University: Shaikh Ayaz University Shikarpur, Pakistan
Base model: google/mt5-base
Language: Sindhi (سنڌي) — Perso-Arabic script
Task: Question Answering (Generative)

How to Use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

tokenizer = AutoTokenizer.from_pretrained('alinawazmahar/mt5-sindhi-qa-sdquad')
model = AutoModelForSeq2SeqLM.from_pretrained('alinawazmahar/mt5-sindhi-qa-sdquad')
model.eval()

def ask_sindhi(question):
    input_text = f'سنڌي سوال: {question}'
    inputs = tokenizer(
        input_text,
        return_tensors='pt',
        max_length=128,
        truncation=True
    )
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=64,
            num_beams=4,
            early_stopping=True
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example
print(ask_sindhi('انرشيا جو مطلب ڇا آهي؟'))
# Output: جسم جي حرڪت يا سڪون کي جاري رکڻ جي صلاحيت.

Training Details

Parameter	Value
Base model	google/mt5-base
Dataset	Aliwj/SdQuAD
Train samples	9,596
Validation samples	1,199
Test samples	1,200
Epochs	10
Batch size	16 (effective)
Learning rate	5e-4
Optimizer	Adafactor
Hardware	Kaggle T4 GPU
Training time	~10 hours

Evaluation Results

Evaluated on SdQuAD test set (1,200 samples):

Metric	Score
F1	50.06
Exact Match	22.08
ROUGE-1	8.18
ROUGE-L	8.18

Sample Predictions

Question	Predicted Answer	Correct?
انرشيا جو مطلب ڇا آهي؟	جسم جي حرڪت يا سڪون کي جاري رکڻ جي صلاحيت.	✅
پاڪستان جو وڏو شهر ڪهڙو آهي؟	پاڪستان جو وڏو شهر ڪراچي آهي.	✅
سيل جي ميمبرين ڪهڙن ٻن مکيه ماليڪيولن مان ٺهيل هوندي آهي؟	سيل جي ميمبرين پروٽين ۽ پروٽين مان ٺهيل هوندي آهي.	⚠️ Partial

Limitations

This is a generative QA model — it generates answers without reading a context paragraph. This means it relies on knowledge learned during training rather than extracting answers from provided text.
May hallucinate answers for questions not well-represented in the training data.
Performance is lower than extractive QA models (baseline F1: 81.47 from SdQuAD paper) due to the harder generative task.
v2.0 coming soon with context-aware extractive QA and improved F1.

Roadmap

v1.0 — Generative QA baseline (F1: 50.06)
v2.0 — Improved hyperparameters (target F1: 60+)
v3.0 — Context-aware extractive QA (target F1: 80+)
Gradio demo on HuggingFace Spaces

Citation

If you use this model in your research, please cite:

@misc{nawaz2026sindhiqa,
  author = {Ali Nawaz},
  title = {MT5 Sindhi Question Answering Model},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/alinawazmahar/mt5-sindhi-qa-sdquad}
}

Also cite the SdQuAD dataset:

@inproceedings{ali2026sdquad,
  title = {SdQuAD: A Large Benchmark Question Answering Dataset for Low-resource Sindhi Language},
  author = {Wazir Ali et al.},
  booktitle = {RESOURCEFUL-2026, LREC},
  year = {2026}
}

Contact

Ali Nawaz
Shaikh Ayaz University Shikarpur, Pakistan
LinkedIn: Ali Nawaz

This model is part of ongoing research in Sindhi NLP — a severely under-resourced language deserving more attention from the global NLP community.

Downloads last month: 182

Safetensors

Model size

1.0B params

Tensor type

F32

Model tree for alinawazmahar/mt5-sindhi-qa-sdquad

Base model

google/mt5-base

Finetuned

(314)

this model