Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Hugging Face's logo


  • om
  • am
  • rw
  • rn
  • ha
  • ig
  • pcm
  • so
  • sw
  • ti
  • yo
  • multilingual


Model desription

AfriTeVa base is a multilingual sequence to sequence model pretrained on 10 African languages


Afaan Oromoo(orm), Amharic(amh), Gahuza(gah), Hausa(hau), Igbo(igb), Nigerian Pidgin(pcm), Somali(som), Swahili(swa), Tigrinya(tig), Yoruba(yor)

More information on the model, dataset:

The model

  • 229M parameters encoder-decoder architecture (T5-like)
  • 12 layers, 12 attention heads and 512 token sequence length

The dataset

  • Multilingual: 10 African languages listed above
  • 143 Million Tokens (1GB of text data)
  • Tokenizer Vocabulary Size: 70,000 tokens

Intended uses & limitations

afriteva_base is pre-trained model and primarily aimed at being fine-tuned on multilingual sequence-to-sequence tasks.

>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriteva_base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("castorini/afriteva_base")

>>> src_text = "Ó hùn ọ́ láti di ara wa bí?"
>>> tgt_text =  "Would you like to be?"

>>> model_inputs = tokenizer(src_text, return_tensors="pt")
>>> with tokenizer.as_target_tokenizer():
        labels = tokenizer(tgt_text, return_tensors="pt").input_ids

>>> model(**model_inputs, labels=labels) # forward pass

Training Procedure

For information on training procedures, please refer to the AfriTeVa paper or repository

BibTex entry and Citation info

coming soon ...

Downloads last month
Hosted inference API
This model can be loaded on the Inference API on-demand.