luisarmando
/

mt5-large-es-nah

@@ -9,7 +9,9 @@ tags:
 - PyTorch
 - Safetensors
 widget:
-- text: 'translate spanish to nahuatl: México lindo y querido.'
 ---
 # mt5-large-spanish-nahuatl
@@ -25,15 +27,22 @@ This model is an MT5 Transformer ([mt5-large](https://huggingface.co/google/mt5-
 from transformers import AutoModelForSeq2SeqLM
 from transformers import AutoTokenizer
-model = AutoModelForSeq2SeqLM.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')
-tokenizer = AutoTokenizer.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')
 model.eval()
-sentence = 'muchas flores son blancas'
-input_ids = tokenizer('translate Spanish to Nahuatl: ' + sentence, return_tensors='pt').input_ids
-outputs = model.generate(input_ids)
-# outputs = miak xochitl istak
-outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
 ```
 ## Approach
@@ -63,14 +72,11 @@ Also, additional 30,000 samples were collected from the web to enhance the data.
 ### Model and training
 The employed method uses a single training stage using the mt5. This model was leveraged given that it can handle different vocabularies and prefixes.
-### Training-stage 1 (learning Spanish)
-In training stage 1, we first introduce Spanish to the model. The goal is to learn a new language rich in data (Spanish) and not lose the previous knowledge. We use the English-Spanish [Anki](https://www.manythings.org/anki/) dataset, which consists of 118.964 text pairs. The model is trained till convergence, adding the prefix "Translate Spanish to English: "
-### Training-stage 2 (learning Nahuatl)
-We use the pre-trained Spanish-English model to learn Spanish-Nahuatl. Since the amount of Nahuatl pairs is limited, we also add 20,000 samples from the English-Spanish Anki dataset. This two-task training avoids overfitting and makes the model more robust.
 ### Training setup
-We train the models on the same datasets for 660k steps using batch size = 16 and a learning rate of 2e-5.
 ## Evaluation results
@@ -86,9 +92,6 @@ The results are reported using CHRF++ and BLEU:
 | True                        | Zero-shot            | 5.24 | 25.7  |
 ## References
-- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits
-of transfer learning with a unified Text-to-Text transformer.
 - Ximena Gutierrez-Vasques, Gerardo Sierra, and Hernandez Isaac. 2016. Axolotl: a web accessible parallel corpus for Spanish-Nahuatl. In International Conference on Language Resources and Evaluation (LREC).
 - https://github.com/christos-c/bible-corpus

 - PyTorch
 - Safetensors
 widget:
+- text: 'translate spanish to nahuatl: Quiero agua.'
+- text: 'or'
+- text: 'translate nahuatl to spanish: Nimitstlazohkamate.'
 ---
 # mt5-large-spanish-nahuatl
 from transformers import AutoModelForSeq2SeqLM
 from transformers import AutoTokenizer
+model = AutoModelForSeq2SeqLM.from_pretrained('luisarmando/mt5-large-es-nah')
+tokenizer = AutoTokenizer.from_pretrained('luisarmando/mt5-large-es-nah')
 model.eval()
+#Translate Spanish to Nah
+input_ids = tokenizer("translate spanish to nahuatl: conejo", return_tensors="pt").input_ids
+outputs = model.generate(input_ids.to("cuda"))
+tokenizer.batch_decode(outputs, skip_special_tokens=True)
+# outputs = tochtli
+#Translate Nah to Spa
+input_ids = tokenizer("translate nahuatl to spanish: xochitl", return_tensors="pt").input_ids
+outputs = model.generate(input_ids.to("cuda"))
+tokenizer.batch_decode(outputs, skip_special_tokens=True)
+# outputs = flor
 ```
 ## Approach
 ### Model and training
 The employed method uses a single training stage using the mt5. This model was leveraged given that it can handle different vocabularies and prefixes.
+### Training
+The model is trained till convergence, adding the prefixes "translate spanish to nahuatl: + word" and "translate nahuatl to spanish: + word".
 ### Training setup
+The model uses the same dataset for 77,500 steps using batch size = 4 and a learning rate of 1e-4.
 ## Evaluation results
 | True                        | Zero-shot            | 5.24 | 25.7  |
 ## References
 - Ximena Gutierrez-Vasques, Gerardo Sierra, and Hernandez Isaac. 2016. Axolotl: a web accessible parallel corpus for Spanish-Nahuatl. In International Conference on Language Resources and Evaluation (LREC).
 - https://github.com/christos-c/bible-corpus