luisarmando commited on
Commit
ae18166
1 Parent(s): 1987119

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -17
README.md CHANGED
@@ -9,7 +9,9 @@ tags:
9
  - PyTorch
10
  - Safetensors
11
  widget:
12
- - text: 'translate spanish to nahuatl: México lindo y querido.'
 
 
13
  ---
14
 
15
  # mt5-large-spanish-nahuatl
@@ -25,15 +27,22 @@ This model is an MT5 Transformer ([mt5-large](https://huggingface.co/google/mt5-
25
  from transformers import AutoModelForSeq2SeqLM
26
  from transformers import AutoTokenizer
27
 
28
- model = AutoModelForSeq2SeqLM.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')
29
- tokenizer = AutoTokenizer.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')
30
 
31
  model.eval()
32
- sentence = 'muchas flores son blancas'
33
- input_ids = tokenizer('translate Spanish to Nahuatl: ' + sentence, return_tensors='pt').input_ids
34
- outputs = model.generate(input_ids)
35
- # outputs = miak xochitl istak
36
- outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
 
 
 
 
 
 
 
37
  ```
38
 
39
  ## Approach
@@ -63,14 +72,11 @@ Also, additional 30,000 samples were collected from the web to enhance the data.
63
  ### Model and training
64
  The employed method uses a single training stage using the mt5. This model was leveraged given that it can handle different vocabularies and prefixes.
65
 
66
- ### Training-stage 1 (learning Spanish)
67
- In training stage 1, we first introduce Spanish to the model. The goal is to learn a new language rich in data (Spanish) and not lose the previous knowledge. We use the English-Spanish [Anki](https://www.manythings.org/anki/) dataset, which consists of 118.964 text pairs. The model is trained till convergence, adding the prefix "Translate Spanish to English: "
68
-
69
- ### Training-stage 2 (learning Nahuatl)
70
- We use the pre-trained Spanish-English model to learn Spanish-Nahuatl. Since the amount of Nahuatl pairs is limited, we also add 20,000 samples from the English-Spanish Anki dataset. This two-task training avoids overfitting and makes the model more robust.
71
 
72
  ### Training setup
73
- We train the models on the same datasets for 660k steps using batch size = 16 and a learning rate of 2e-5.
74
 
75
 
76
  ## Evaluation results
@@ -86,9 +92,6 @@ The results are reported using CHRF++ and BLEU:
86
  | True | Zero-shot | 5.24 | 25.7 |
87
 
88
  ## References
89
- - Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits
90
- of transfer learning with a unified Text-to-Text transformer.
91
-
92
  - Ximena Gutierrez-Vasques, Gerardo Sierra, and Hernandez Isaac. 2016. Axolotl: a web accessible parallel corpus for Spanish-Nahuatl. In International Conference on Language Resources and Evaluation (LREC).
93
 
94
  - https://github.com/christos-c/bible-corpus
 
9
  - PyTorch
10
  - Safetensors
11
  widget:
12
+ - text: 'translate spanish to nahuatl: Quiero agua.'
13
+ - text: 'or'
14
+ - text: 'translate nahuatl to spanish: Nimitstlazohkamate.'
15
  ---
16
 
17
  # mt5-large-spanish-nahuatl
 
27
  from transformers import AutoModelForSeq2SeqLM
28
  from transformers import AutoTokenizer
29
 
30
+ model = AutoModelForSeq2SeqLM.from_pretrained('luisarmando/mt5-large-es-nah')
31
+ tokenizer = AutoTokenizer.from_pretrained('luisarmando/mt5-large-es-nah')
32
 
33
  model.eval()
34
+
35
+ #Translate Spanish to Nah
36
+ input_ids = tokenizer("translate spanish to nahuatl: conejo", return_tensors="pt").input_ids
37
+ outputs = model.generate(input_ids.to("cuda"))
38
+ tokenizer.batch_decode(outputs, skip_special_tokens=True)
39
+ # outputs = tochtli
40
+
41
+ #Translate Nah to Spa
42
+ input_ids = tokenizer("translate nahuatl to spanish: xochitl", return_tensors="pt").input_ids
43
+ outputs = model.generate(input_ids.to("cuda"))
44
+ tokenizer.batch_decode(outputs, skip_special_tokens=True)
45
+ # outputs = flor
46
  ```
47
 
48
  ## Approach
 
72
  ### Model and training
73
  The employed method uses a single training stage using the mt5. This model was leveraged given that it can handle different vocabularies and prefixes.
74
 
75
+ ### Training
76
+ The model is trained till convergence, adding the prefixes "translate spanish to nahuatl: + word" and "translate nahuatl to spanish: + word".
 
 
 
77
 
78
  ### Training setup
79
+ The model uses the same dataset for 77,500 steps using batch size = 4 and a learning rate of 1e-4.
80
 
81
 
82
  ## Evaluation results
 
92
  | True | Zero-shot | 5.24 | 25.7 |
93
 
94
  ## References
 
 
 
95
  - Ximena Gutierrez-Vasques, Gerardo Sierra, and Hernandez Isaac. 2016. Axolotl: a web accessible parallel corpus for Spanish-Nahuatl. In International Conference on Language Resources and Evaluation (LREC).
96
 
97
  - https://github.com/christos-c/bible-corpus