msmaje/filtered_expanded_hausa_corpus
Viewer β’ Updated β’ 590k β’ 126 β’ 1
This model is a fine-tuned version of google/mt5-small
on the msmaje/filtered_expanded_hausa_corpus
dataset (590K rows of filtered, quality-scored Hausa text from 15 sources).
It serves as the generator component of HIKAF β the Hausa Inclusive Knowledge Access Framework β a Retrieval-Augmented Generation (RAG) system for inclusive knowledge access in the Hausa language, presented at RISE 2026 (Harnessing Digital Innovation and Emerging Technologies for Inclusive Growth and Economic Renaissance).
Given a Hausa context passage (retrieved from a knowledge base), the model generates a fluent, grounded Hausa continuation/response, completing the RAG pipeline.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("msmaje/hikaf-hausa-mt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("msmaje/hikaf-hausa-mt5-small")
context = "ci gaba da rubutun Hausa: Wurin Shakatawa na Yankari babban wurin shakatawa ne..."
inputs = tokenizer(context, return_tensors="pt", max_length=128, truncation=True)
outputs = model.generate(
**inputs,
max_new_tokens=100,
num_beams=4,
early_stopping=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
| Parameter | Value |
|---|---|
| Base model | google/mt5-small |
| Dataset | msmaje/filtered_expanded_hausa_corpus |
| Training samples | 80,000 |
| Quality threshold | >= 50 |
| Epochs | 4 |
| Batch size (effective) | 32 |
| Learning rate | 0.0003 |
| Input max length | 128 tokens |
| Target max length | 128 tokens |
| Eval ROUGE-L | {eval_metrics.get('eval_rougeL', 'N/A')} |
User Query (Hausa)
β
βΌ
LaBSE Encoder βββ FAISS Index (590K Hausa passages)
β β
β Top-5 Retrieved Passages
β β
ββββββββββββββββββ
β
Context Augmentation
β
βΌ
hikaf-hausa-mt5-small (this model)
β
βΌ
Hausa Response (grounded, verified)
@inproceedings{maje2026hikaf,
title = {Harnessing Retrieval-Augmented Generation (RAG) Systems for Inclusive
Knowledge Access in the Hausa Language},
author = {Maje, Musa},
booktitle = {Proceedings of RISE 2026},
year = {2026},
}
Base model
google/mt5-small