AfriScience-MT
Collection
AfriScience-MT (arXiv:2605.29741): MT models for African scientific text in Amharic, Hausa, Luganda, N. Sotho, Yoruba and isiZulu. • 254 items • Updated
This model is part of the AfriScience-MT project, focused on machine translation of scientific texts for African languages.
| Property | Value |
|---|---|
| Model Type | Seq2Seq Translation |
| Translation Direction | English → Luganda |
| Base Model | facebook/m2m100_418M |
| Domain | Scientific/Academic texts |
| Training | Full fine-tuning on AfriScience-MT dataset |
Performance on the AfriScience-MT test set:
| Split | BLEU | chrF | SSA-COMET |
|---|---|---|---|
| Validation | 22.96 | 50.48 | 64.53 |
| Test | 20.90 | 48.61 | 63.24 |
Metrics explanation:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_id = "dsfsi/m2m100_418m-eng-lug"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Set source language
tokenizer.src_lang = "en"
# Translate
text = "The mitochondria is the powerhouse of the cell."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)
# Generate with target language
forced_bos_token_id = tokenizer.get_lang_id("lg")
outputs = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id, max_length=256, num_beams=5)
translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(translation)
texts = [
"Climate change affects agricultural productivity.",
"The study analyzed genetic markers in the population.",
"Renewable energy sources are essential for sustainable development."
]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256)
outputs = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id, max_length=256, num_beams=5)
translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for src, tgt in zip(texts, translations):
print(f"{src}\n→ {tgt}\n")
| Parameter | Value |
|---|---|
| Epochs | 10 |
| Batch Size | 8 |
| Learning Rate | 2e-05 |
To reproduce this model:
# Clone the AfriScience-MT repository
git clone https://github.com/afriscience-mt/afriscience-mt.git
cd afriscience-mt
# Install dependencies
pip install -r requirements.txt
# Run training
python -m afriscience_mt.scripts.run_seq2seq_training \
--data_dir ./data \
--source_lang eng \
--target_lang lug \
--model_name facebook/m2m100_418M \
--model_type m2m100 \
--output_dir ./output \
--num_epochs 10 \
--batch_size 16 \
--learning_rate 2e-5
If you use this model, please cite the AfriScience-MT paper (arXiv:2605.29741):
@article{abdulmumin2026afriscience,
title = {AfriScience-MT: Towards Decolonizing Science in Africa through Text Translation},
author = {Abdulmumin, Idris and Gwadabe, Tajuddeen and Muhammad, Shamsuddeen Hassan and Adelani, David Ifeoluwa and Khalo, Nomonde and Ahmad, Ibrahim Said and Modupe, Abiodun and Mumm, Anina and Biyela, Sibusiso and Rabie, Michelle and Havemann, Johanna and Rei, Marek and Abbott, Jade and Marivate, Vukosi},
journal = {arXiv preprint arXiv:2605.29741},
year = {2026},
url = {https://arxiv.org/abs/2605.29741}
}
This model is released under the Apache 2.0 License.
Base model
facebook/m2m100_418M