--- language: - es - nah - multilingual license: mpl-2.0 tags: - translation - PyTorch - Safetensors widget: - text: 'translate spanish to nahuatl: México lindo y querido.' --- # mt5-large-spanish-nahuatl Nahuatl is the most widely spoken indigenous language in Mexico, yet training a neural network for machine translation presents significant challenges due to insufficient structured data. Popular datasets, like Axolot and the Bible corpus, contain only approximately 16,000 and 7,000 samples, respectively. Complicating matters further, Nahuatl has multiple dialects, and a single word in the Axolot dataset can appear in over three different forms. I conclude with evaluations of the model's performance using the Chrf and BLEU metrics. ## Model description This model is an MT5 Transformer ([mt5-large](https://huggingface.co/google/mt5-large)) fine-tuned on Spanish and Nahuatl sentences collected from diverse places online. The dataset is normalized using 'inali' normalization from [py-elotl](https://github.com/ElotlMX/py-elotl). ## Usage ```python from transformers import AutoModelForSeq2SeqLM from transformers import AutoTokenizer model = AutoModelForSeq2SeqLM.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl') tokenizer = AutoTokenizer.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl') model.eval() sentence = 'muchas flores son blancas' input_ids = tokenizer('translate Spanish to Nahuatl: ' + sentence, return_tensors='pt').input_ids outputs = model.generate(input_ids) # outputs = miak xochitl istak outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0] ``` ## Approach ### Dataset Since the Axolotl corpus contains misalignments, the best samples were selected (12,207). This in addition to the [bible-corpus](https://github.com/christos-c/bible-corpus) (7,821). | Axolotl best aligned books | |:-----------------------------------------------------:| | Anales de Tlatelolco | | Diario | | Documentos nauas de la Ciudad de México del siglo XVI | | Historia de México narrada en náhuatl y español | | La tinta negra y roja (antología de poesía náhuatl) | | Memorial Breve (Libro las ocho relaciones) | | Método auto-didáctico náhuatl-español | | Nican Mopohua | | Quinta Relación (Libro las ocho relaciones) | | Recetario Nahua de Milpa Alta D.F | | Testimonios de la antigua palabra | | Trece Poetas del Mundo Azteca | | Una tortillita nomás - Se taxkaltsin saj | | Vida económica de Tenochtitlan | Also, additional 30,000 samples were collected from the web to enhance the data. ### Model and training The employed method uses a single training stage using the mt5. This model was leveraged given that it can handle different vocabularies and prefixes. ### Training-stage 1 (learning Spanish) In training stage 1, we first introduce Spanish to the model. The goal is to learn a new language rich in data (Spanish) and not lose the previous knowledge. We use the English-Spanish [Anki](https://www.manythings.org/anki/) dataset, which consists of 118.964 text pairs. The model is trained till convergence, adding the prefix "Translate Spanish to English: " ### Training-stage 2 (learning Nahuatl) We use the pre-trained Spanish-English model to learn Spanish-Nahuatl. Since the amount of Nahuatl pairs is limited, we also add 20,000 samples from the English-Spanish Anki dataset. This two-task training avoids overfitting and makes the model more robust. ### Training setup We train the models on the same datasets for 660k steps using batch size = 16 and a learning rate of 2e-5. ## Evaluation results The models are evaluated on 2 different datasets: 1. First on the test sentences similar to the evaluation ones. 2. Then, on zero-shot sentences obtained from the test sentences of AmericasNLP2021 The results are reported using CHRF++ and BLEU: | Nahuatl-Spanish Bidirectional Training | Set | BLEU | CHRF++ | |:----------------------------:|:---------------:|:-----|-------:| | True | Test | 18.01 | 54.15 | | True | Zero-shot | 5.24 | 25.7 | ## References - Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified Text-to-Text transformer. - Ximena Gutierrez-Vasques, Gerardo Sierra, and Hernandez Isaac. 2016. Axolotl: a web accessible parallel corpus for Spanish-Nahuatl. In International Conference on Language Resources and Evaluation (LREC). - https://github.com/christos-c/bible-corpus - https://github.com/ElotlMX/py-elotl