File size: 1,651 Bytes
d60e552
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c8540f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d60e552
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
Hugging Face's logo
---
language: 
- om
- am
- rw
- rn
- ha
- ig
- pcm
- so
- sw
- ti
- yo
- multilingual
- T5

---
# afriteva_small

## Model desription

AfriTeVa small is a sequence to sequence model pretrained on 10 African languages

## Languages

Afaan Oromoo(orm), Amharic(amh), Gahuza(gah), Hausa(hau), Igbo(igb), Nigerian Pidgin(pcm), Somali(som), Swahili(swa), Tigrinya(tig), Yoruba(yor)

### More information on the model, dataset:

### The model

- 64M parameters encoder-decoder architecture (T5-like)
- 6 layers, 8 attention heads and 512 token sequence length

### The dataset

- Multilingual: 10 African languages listed above
- 143 Million Tokens (1GB of text data)
- Tokenizer Vocabulary Size: 70,000 tokens

## Intended uses & limitations 

`afriteva_small` is pre-trained model and primarily aimed at being fine-tuned on multilingual sequence-to-sequence tasks. 

```python
>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriteva_small")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("castorini/afriteva_small")

>>> src_text = "Ó hùn ọ́ láti di ara wa bí?"
>>> tgt_text =  "Would you like to be?"

>>> model_inputs = tokenizer(src_text, return_tensors="pt")
>>> with tokenizer.as_target_tokenizer():
        labels = tokenizer(tgt_text, return_tensors="pt").input_ids

>>> model(**model_inputs, labels=labels) # forward pass
```

## Training Procedure

For information on training procedures, please refer to the AfriTeVa [paper](#) or [repository](https://github.com/castorini/afriteva)

## BibTex entry and Citation info

coming soon ...