Hugging Face's logo --- language: - om - am - rw - rn - ha - ig - pcm - so - sw - ti - yo - multilingual --- # afriteva_base ## Model desription AfriTeVa base is a multilingual sequence to sequence model pretrained on 10 African languages ## Languages Afaan Oromoo(orm), Amharic(amh), Gahuza(gah), Hausa(hau), Igbo(igb), Nigerian Pidgin(pcm), Somali(som), Swahili(swa), Tigrinya(tig), Yoruba(yor) ### More information on the model, dataset: ### The model - 229M parameters encoder-decoder architecture (T5-like) - 12 layers, 12 attention heads and 512 token sequence length ### The dataset - Multilingual: 10 African languages listed above - 143 Million Tokens (1GB of text data) - Tokenizer Vocabulary Size: 70,000 tokens ## Intended uses & limitations `afriteva_base` is pre-trained model and primarily aimed at being fine-tuned on multilingual sequence-to-sequence tasks. ```python >>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriteva_base") >>> model = AutoModelForSeq2SeqLM.from_pretrained("castorini/afriteva_base") >>> src_text = "Ó hùn ọ́ láti di ara wa bí?" >>> tgt_text = "Would you like to be?" >>> model_inputs = tokenizer(src_text, return_tensors="pt") >>> with tokenizer.as_target_tokenizer(): labels = tokenizer(tgt_text, return_tensors="pt").input_ids >>> model(**model_inputs, labels=labels) # forward pass ``` ## Training Procedure For information on training procedures, please refer to the AfriTeVa [paper](#) or [repository](https://github.com/castorini/afriteva) ## BibTex entry and Citation info coming soon ...