ToluClassics
commited on
Commit
•
6760505
1
Parent(s):
c8540f5
Update README.md
Browse files
README.md
CHANGED
@@ -39,6 +39,21 @@ Afaan Oromoo(orm), Amharic(amh), Gahuza(gah), Hausa(hau), Igbo(igb), Nigerian Pi
|
|
39 |
- 143 Million Tokens (1GB of text data)
|
40 |
- Tokenizer Vocabulary Size: 70,000 tokens
|
41 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
## Training Procedure
|
43 |
|
44 |
For information on training procedures, please refer to the AfriTeVa [paper](#) or [repository](https://github.com/castorini/afriteva)
|
|
|
39 |
- 143 Million Tokens (1GB of text data)
|
40 |
- Tokenizer Vocabulary Size: 70,000 tokens
|
41 |
|
42 |
+
## Intended uses & limitations
|
43 |
+
`afriteva_small` is pre-trained model and primarily aimed at being fine-tuned on multilingual sequence-to-sequence tasks.
|
44 |
+
|
45 |
+
```python
|
46 |
+
>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
47 |
+
>>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriteva_small")
|
48 |
+
>>> model = AutoModelForSeq2SeqLM.from_pretrained("castorini/afriteva_small")
|
49 |
+
>>> src_text = "Ó hùn ọ́ láti di ara wa bí?"
|
50 |
+
>>> tgt_text = "Would you like to be?"
|
51 |
+
>>> model_inputs = tokenizer(src_text, return_tensors="pt")
|
52 |
+
>>> with tokenizer.as_target_tokenizer():
|
53 |
+
labels = tokenizer(tgt_text, return_tensors="pt").input_ids
|
54 |
+
>>> model(**model_inputs, labels=labels) # forward pass
|
55 |
+
```
|
56 |
+
|
57 |
## Training Procedure
|
58 |
|
59 |
For information on training procedures, please refer to the AfriTeVa [paper](#) or [repository](https://github.com/castorini/afriteva)
|