Edit model card

afriteva_base

Model desription

AfriTeVa base is a multilingual sequence to sequence model pretrained on 10 African languages

Languages

Afaan Oromoo(orm), Amharic(amh), Gahuza(gah), Hausa(hau), Igbo(igb), Nigerian Pidgin(pcm), Somali(som), Swahili(swa), Tigrinya(tig), Yoruba(yor)

More information on the model, dataset:

The model

  • 229M parameters encoder-decoder architecture (T5-like)
  • 12 layers, 12 attention heads and 512 token sequence length

The dataset

  • Multilingual: 10 African languages listed above
  • 143 Million Tokens (1GB of text data)
  • Tokenizer Vocabulary Size: 70,000 tokens

Intended uses & limitations

afriteva_base is pre-trained model and primarily aimed at being fine-tuned on multilingual sequence-to-sequence tasks.

>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriteva_base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("castorini/afriteva_base")

>>> src_text = "Ó hùn ọ́ láti di ara wa bí?"
>>> tgt_text =  "Would you like to be?"

>>> model_inputs = tokenizer(src_text, return_tensors="pt")
>>> with tokenizer.as_target_tokenizer():
        labels = tokenizer(tgt_text, return_tensors="pt").input_ids

>>> model(**model_inputs, labels=labels) # forward pass

Training Procedure

For information on training procedures, please refer to the AfriTeVa paper or repository

BibTex entry and Citation info

coming soon ...

Downloads last month
627
Hosted inference API
This model can be loaded on the Inference API on-demand.