somosnlp-hackathon-2022
/

t5-small-spanish-nahuatl

@@ -61,10 +61,10 @@ Also, we collected 3,000 extra samples from the web to increase the data.
 We employ two training stages using a multilingual T5-small. The advantage of this model is that it can handle different vocabularies and prefixes. T5-small is pre-trained on different tasks and languages (French, Romanian, English, German).
 ### Training-stage 1 (learning Spanish)
-In training stage 1, we first introduce Spanish to the model. The goal is to learn a new language rich in data (Spanish) and not lose the previous knowledge. We use the English-Spanish [Anki](https://www.manythings.org/anki/) dataset, which consists of 118.964 text pairs. Next, we train the model till convergence, adding the prefix "Translate Spanish to English: "
 ### Training-stage 2 (learning Nahuatl)
-We use the pre-trained Spanish-English model to learn Spanish-Nahuatl. Since the amount of Nahuatl pairs is limited, we also add 20,000 samples from the English-Spanish Anki dataset to our dataset. This two-task training avoids overfitting and makes the model more robust.
 ### Training setup
 We train the models on the same datasets for 660k steps using batch size = 16 and a learning rate of 2e-5.
@@ -79,7 +79,7 @@ We evaluate the model on the same 505 validation Nahuatl sentences for a fair co
 | True                         | 1.31            | 6.18 | 28.21  |
-The English-Spanish pretraining improves BLEU and Chrf and leads to faster convergence. Is it possible to reproduce the evaluation on the [eval.ipynb](https://github.com/milmor/spanish-nahuatl-translation/blob/main/eval.ipynb) notebook.
 ## References
 - Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits

 We employ two training stages using a multilingual T5-small. The advantage of this model is that it can handle different vocabularies and prefixes. T5-small is pre-trained on different tasks and languages (French, Romanian, English, German).
 ### Training-stage 1 (learning Spanish)
+In training stage 1, we first introduce Spanish to the model. The goal is to learn a new language rich in data (Spanish) and not lose the previous knowledge. We use the English-Spanish [Anki](https://www.manythings.org/anki/) dataset, which consists of 118.964 text pairs. The model is trained till convergence, adding the prefix "Translate Spanish to English: "
 ### Training-stage 2 (learning Nahuatl)
+We use the pre-trained Spanish-English model to learn Spanish-Nahuatl. Since the amount of Nahuatl pairs is limited, we also add 20,000 samples from the English-Spanish Anki dataset. This two-task training avoids overfitting and makes the model more robust.
 ### Training setup
 We train the models on the same datasets for 660k steps using batch size = 16 and a learning rate of 2e-5.
 | True                         | 1.31            | 6.18 | 28.21  |
+The English-Spanish pretraining improves BLEU and Chrf and leads to faster convergence. The evaluation is available on the [eval.ipynb](https://github.com/milmor/spanish-nahuatl-translation/blob/main/eval.ipynb) notebook.
 ## References
 - Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits