Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the neural machine translation task is hard due to the lack of structured data. The most popular datasets such as the Axolot dataset and the bible-corpus only consist of ~16,000 and ~7,000 samples respectively. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, a single word from the Axolot dataset can be found written in more than three different ways. Therefore, in this work, we leverage the T5 text-to-text prefix training strategy to compensate for the lack of data. We first teach the multilingual model Spanish using English, then we make the transition to Spanish-Nahuatl. The resulting model successfully translates short sentences from Spanish to Nahuatl. We report Chrf and BLEU results.
from transformers import AutoModelForSeq2SeqLM from transformers import AutoTokenizer model = AutoModelForSeq2SeqLM.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl') tokenizer = AutoTokenizer.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl') model.eval() sentence = 'muchas flores son blancas' input_ids = tokenizer('translate Spanish to Nahuatl: ' + sentence, return_tensors='pt').input_ids outputs = model.generate(input_ids) # outputs = miak xochitl istak outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
Since the Axolotl corpus contains misaligments, we just select the best samples (12,207 samples). We also use the bible-corpus (7,821 samples).
|Axolotl best aligned books|
|Anales de Tlatelolco|
|Documentos nauas de la Ciudad de México del siglo XVI|
|Historia de México narrada en náhuatl y español|
|La tinta negra y roja (antología de poesía náhuatl)|
|Memorial Breve (Libro las ocho relaciones)|
|Método auto-didáctico náhuatl-español|
|Quinta Relación (Libro las ocho relaciones)|
|Recetario Nahua de Milpa Alta D.F|
|Testimonios de la antigua palabra|
|Trece Poetas del Mundo Azteca|
|Una tortillita nomás - Se taxkaltsin saj|
|Vida económica de Tenochtitlan|
Also, to increase the amount of data, we collected 3,000 extra samples from the web.
We employ two training stages using a multilingual T5-small. We use this model because it can handle different vocabularies and prefixes. T5-small is pre-trained on different tasks and languages (French, Romanian, English, German).
In training stage 1 we first introduce Spanish to the model. The goal is to learn a new language rich in data (Spanish) and not lose the previous knowledge acquired. We use the English-Spanish Anki dataset, which consists of 118.964 text pairs. We train the model till convergence adding the prefix "Translate Spanish to English: ".
We use the pre-trained Spanish-English model to learn Spanish-Nahuatl. Since the amount of Nahuatl pairs is limited, we also add to our dataset 20,000 samples from the English-Spanish Anki dataset. This two-task-training avoids overfitting end makes the model more robust.
We train the models on the same datasets for 660k steps using batch size = 16 and a learning rate of 2e-5.
For a fair comparison, the models are evaluated on the same 505 validation Nahuatl sentences. We report the results using chrf and sacrebleu hugging face metrics:
|English-Spanish pretraining||Validation loss||BLEU||Chrf|
The English-Spanish pretraining improves BLEU and Chrf, and leads to faster convergence. You can reproduce the evaluation on the eval.ipynb notebook.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified Text-to-Text transformer.
Ximena Gutierrez-Vasques, Gerardo Sierra, and Hernandez Isaac. 2016. Axolotl: a web accessible parallel corpus for Spanish-Nahuatl. In International Conference on Language Resources and Evaluation (LREC).
- Downloads last month