afriteva_large / README.md
ToluClassics's picture
Update from ToluClassics
e7b0b5d
|
raw
history blame
940 Bytes
Hugging Face's logo
---
language:
- om
- am
- rw
- rn
- ha
- ig
- pcm
- so
- sw
- ti
- yo
- multilingual
- T5
---
# afriteva_base
## Model desription
AfriTeVa base is a sequence to sequence model pretrained on 10 African languages
## Languages
Afaan Oromoo(orm), Amharic(amh), Gahuza(gah), Hausa(hau), Igbo(igb), Nigerian Pidgin(pcm), Somali(som), Swahili(swa), Tigrinya(tig), Yoruba(yor)
### More information on the model, dataset:
### The model
- 229M parameters encoder-decoder architecture (T5-like)
- 12 layers, 12 attention heads and 512 token sequence length
### The dataset
- Multilingual: 10 African languages listed above
- 143 Million Tokens (1GB of text data)
- Tokenizer Vocabulary Size: 70,000 tokens
## Training Procedure
For information on training procedures, please refer to the AfriTeVa [paper](#) or [repository](https://github.com/castorini/afriteva)
## BibTex entry and Citation info
coming soon ...