Commit
•
ae18166
1
Parent(s):
1987119
Update README.md
Browse files
README.md
CHANGED
@@ -9,7 +9,9 @@ tags:
|
|
9 |
- PyTorch
|
10 |
- Safetensors
|
11 |
widget:
|
12 |
-
- text: 'translate spanish to nahuatl:
|
|
|
|
|
13 |
---
|
14 |
|
15 |
# mt5-large-spanish-nahuatl
|
@@ -25,15 +27,22 @@ This model is an MT5 Transformer ([mt5-large](https://huggingface.co/google/mt5-
|
|
25 |
from transformers import AutoModelForSeq2SeqLM
|
26 |
from transformers import AutoTokenizer
|
27 |
|
28 |
-
model = AutoModelForSeq2SeqLM.from_pretrained('
|
29 |
-
tokenizer = AutoTokenizer.from_pretrained('
|
30 |
|
31 |
model.eval()
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
```
|
38 |
|
39 |
## Approach
|
@@ -63,14 +72,11 @@ Also, additional 30,000 samples were collected from the web to enhance the data.
|
|
63 |
### Model and training
|
64 |
The employed method uses a single training stage using the mt5. This model was leveraged given that it can handle different vocabularies and prefixes.
|
65 |
|
66 |
-
### Training
|
67 |
-
|
68 |
-
|
69 |
-
### Training-stage 2 (learning Nahuatl)
|
70 |
-
We use the pre-trained Spanish-English model to learn Spanish-Nahuatl. Since the amount of Nahuatl pairs is limited, we also add 20,000 samples from the English-Spanish Anki dataset. This two-task training avoids overfitting and makes the model more robust.
|
71 |
|
72 |
### Training setup
|
73 |
-
|
74 |
|
75 |
|
76 |
## Evaluation results
|
@@ -86,9 +92,6 @@ The results are reported using CHRF++ and BLEU:
|
|
86 |
| True | Zero-shot | 5.24 | 25.7 |
|
87 |
|
88 |
## References
|
89 |
-
- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits
|
90 |
-
of transfer learning with a unified Text-to-Text transformer.
|
91 |
-
|
92 |
- Ximena Gutierrez-Vasques, Gerardo Sierra, and Hernandez Isaac. 2016. Axolotl: a web accessible parallel corpus for Spanish-Nahuatl. In International Conference on Language Resources and Evaluation (LREC).
|
93 |
|
94 |
- https://github.com/christos-c/bible-corpus
|
|
|
9 |
- PyTorch
|
10 |
- Safetensors
|
11 |
widget:
|
12 |
+
- text: 'translate spanish to nahuatl: Quiero agua.'
|
13 |
+
- text: 'or'
|
14 |
+
- text: 'translate nahuatl to spanish: Nimitstlazohkamate.'
|
15 |
---
|
16 |
|
17 |
# mt5-large-spanish-nahuatl
|
|
|
27 |
from transformers import AutoModelForSeq2SeqLM
|
28 |
from transformers import AutoTokenizer
|
29 |
|
30 |
+
model = AutoModelForSeq2SeqLM.from_pretrained('luisarmando/mt5-large-es-nah')
|
31 |
+
tokenizer = AutoTokenizer.from_pretrained('luisarmando/mt5-large-es-nah')
|
32 |
|
33 |
model.eval()
|
34 |
+
|
35 |
+
#Translate Spanish to Nah
|
36 |
+
input_ids = tokenizer("translate spanish to nahuatl: conejo", return_tensors="pt").input_ids
|
37 |
+
outputs = model.generate(input_ids.to("cuda"))
|
38 |
+
tokenizer.batch_decode(outputs, skip_special_tokens=True)
|
39 |
+
# outputs = tochtli
|
40 |
+
|
41 |
+
#Translate Nah to Spa
|
42 |
+
input_ids = tokenizer("translate nahuatl to spanish: xochitl", return_tensors="pt").input_ids
|
43 |
+
outputs = model.generate(input_ids.to("cuda"))
|
44 |
+
tokenizer.batch_decode(outputs, skip_special_tokens=True)
|
45 |
+
# outputs = flor
|
46 |
```
|
47 |
|
48 |
## Approach
|
|
|
72 |
### Model and training
|
73 |
The employed method uses a single training stage using the mt5. This model was leveraged given that it can handle different vocabularies and prefixes.
|
74 |
|
75 |
+
### Training
|
76 |
+
The model is trained till convergence, adding the prefixes "translate spanish to nahuatl: + word" and "translate nahuatl to spanish: + word".
|
|
|
|
|
|
|
77 |
|
78 |
### Training setup
|
79 |
+
The model uses the same dataset for 77,500 steps using batch size = 4 and a learning rate of 1e-4.
|
80 |
|
81 |
|
82 |
## Evaluation results
|
|
|
92 |
| True | Zero-shot | 5.24 | 25.7 |
|
93 |
|
94 |
## References
|
|
|
|
|
|
|
95 |
- Ximena Gutierrez-Vasques, Gerardo Sierra, and Hernandez Isaac. 2016. Axolotl: a web accessible parallel corpus for Spanish-Nahuatl. In International Conference on Language Resources and Evaluation (LREC).
|
96 |
|
97 |
- https://github.com/christos-c/bible-corpus
|