luisarmando commited on
Commit
1987119
1 Parent(s): 12811b8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -12
README.md CHANGED
@@ -12,8 +12,8 @@ widget:
12
  - text: 'translate spanish to nahuatl: México lindo y querido.'
13
  ---
14
 
15
- # t5-small-spanish-nahuatl
16
- Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the neural machine translation task is challenging due to the lack of structured data. The most popular datasets, such as the Axolot and bible-corpus, only consist of ~16,000 and ~7,000 samples, respectively. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, it is possible to find a single word from the Axolot dataset written in more than three different ways. Therefore, we leverage the T5 text-to-text prefix training strategy to compensate for the lack of data. We first train the multilingual model to learn Spanish and then adapt it to Nahuatl. The resulting T5 Transformer successfully translates short sentences. Finally, we report Chrf and BLEU results.
17
 
18
 
19
  ## Model description
@@ -38,7 +38,8 @@ outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
38
 
39
  ## Approach
40
  ### Dataset
41
- Since the Axolotl corpus contains misalignments, we select the best samples (12,207). We also use the [bible-corpus](https://github.com/christos-c/bible-corpus) (7,821).
 
42
 
43
  | Axolotl best aligned books |
44
  |:-----------------------------------------------------:|
@@ -57,10 +58,10 @@ Since the Axolotl corpus contains misalignments, we select the best samples (12,
57
  | Una tortillita nomás - Se taxkaltsin saj |
58
  | Vida económica de Tenochtitlan |
59
 
60
- Also, we collected 3,000 extra samples from the web to increase the data.
61
 
62
  ### Model and training
63
- We employ two training stages using a multilingual T5-small. The advantage of this model is that it can handle different vocabularies and prefixes. T5-small is pre-trained on different tasks and languages (French, Romanian, English, German).
64
 
65
  ### Training-stage 1 (learning Spanish)
66
  In training stage 1, we first introduce Spanish to the model. The goal is to learn a new language rich in data (Spanish) and not lose the previous knowledge. We use the English-Spanish [Anki](https://www.manythings.org/anki/) dataset, which consists of 118.964 text pairs. The model is trained till convergence, adding the prefix "Translate Spanish to English: "
@@ -73,15 +74,16 @@ We train the models on the same datasets for 660k steps using batch size = 16 an
73
 
74
 
75
  ## Evaluation results
76
- We evaluate the models on the same 505 validation Nahuatl sentences for a fair comparison. Finally, we report the results using chrf and sacrebleu hugging face metrics:
77
-
78
- | English-Spanish pretraining | Validation loss | BLEU | Chrf |
79
- |:----------------------------:|:---------------:|:-----|-------:|
80
- | False | 1.34 | 6.17 | 26.96 |
81
- | True | 1.31 | 6.18 | 28.21 |
82
 
 
83
 
84
- The English-Spanish pretraining improves BLEU and Chrf and leads to faster convergence. The evaluation is available on the [eval.ipynb](https://github.com/milmor/spanish-nahuatl-translation/blob/main/eval.ipynb) notebook.
 
 
 
85
 
86
  ## References
87
  - Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits
 
12
  - text: 'translate spanish to nahuatl: México lindo y querido.'
13
  ---
14
 
15
+ # mt5-large-spanish-nahuatl
16
+ Nahuatl is the most widely spoken indigenous language in Mexico, yet training a neural network for machine translation presents significant challenges due to insufficient structured data. Popular datasets, like Axolot and the Bible corpus, contain only approximately 16,000 and 7,000 samples, respectively. Complicating matters further, Nahuatl has multiple dialects, and a single word in the Axolot dataset can appear in over three different forms. I conclude with evaluations of the model's performance using the Chrf and BLEU metrics.
17
 
18
 
19
  ## Model description
 
38
 
39
  ## Approach
40
  ### Dataset
41
+ Since the Axolotl corpus contains misalignments, the best samples were selected (12,207).
42
+ This in addition to the [bible-corpus](https://github.com/christos-c/bible-corpus) (7,821).
43
 
44
  | Axolotl best aligned books |
45
  |:-----------------------------------------------------:|
 
58
  | Una tortillita nomás - Se taxkaltsin saj |
59
  | Vida económica de Tenochtitlan |
60
 
61
+ Also, additional 30,000 samples were collected from the web to enhance the data.
62
 
63
  ### Model and training
64
+ The employed method uses a single training stage using the mt5. This model was leveraged given that it can handle different vocabularies and prefixes.
65
 
66
  ### Training-stage 1 (learning Spanish)
67
  In training stage 1, we first introduce Spanish to the model. The goal is to learn a new language rich in data (Spanish) and not lose the previous knowledge. We use the English-Spanish [Anki](https://www.manythings.org/anki/) dataset, which consists of 118.964 text pairs. The model is trained till convergence, adding the prefix "Translate Spanish to English: "
 
74
 
75
 
76
  ## Evaluation results
77
+ The models are evaluated on 2 different datasets:
78
+ 1. First on the test sentences similar to the evaluation ones.
79
+ 2. Then, on zero-shot sentences obtained from the test sentences of AmericasNLP2021
 
 
 
80
 
81
+ The results are reported using CHRF++ and BLEU:
82
 
83
+ | Nahuatl-Spanish Bidirectional Training | Set | BLEU | CHRF++ |
84
+ |:----------------------------:|:---------------:|:-----|-------:|
85
+ | True | Test | 18.01 | 54.15 |
86
+ | True | Zero-shot | 5.24 | 25.7 |
87
 
88
  ## References
89
  - Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits