|
--- |
|
language: |
|
- es |
|
- nah |
|
- multilingual |
|
license: mpl-2.0 |
|
tags: |
|
- translation |
|
- PyTorch |
|
widget: |
|
- text: 'translate nahuatl to spanish: Nimitstlasohkamate' |
|
- text: 'translate spanish to nahuatl: agua' |
|
--- |
|
|
|
# mt5-large-spanish-nahuatl |
|
Nahuatl is a polysynthetic macro-language from Mexico which contains 30 variants and is spoken by over 1.6M native speakers. Making it the most spoken native language in the country. |
|
|
|
However, training a neural network for machine translation presents significant challenges due to insufficient structured data. |
|
Popular datasets, like Axolot and the Bible corpus, contain only approximately 16,000 and 7,000 samples, respectively. |
|
Complicating matters further, is the linguistic diversity of Nahuatl itself which results in, for example, the fact that a single word in the Axolot dataset can appear in more than three different forms. |
|
|
|
## Inference API use |
|
You can change and translate to nahuatl or spanish just do: |
|
|
|
``` |
|
translate spanish to nahuatl: agua |
|
# atl |
|
``` |
|
or |
|
``` |
|
translate nahuatl to spanish: Nimitstlazohkamate |
|
# gracias |
|
``` |
|
|
|
## Model description |
|
This model is an MT5 Transformer ([mt5-large](https://huggingface.co/google/mt5-large)) fine-tuned on Spanish and Nahuatl sentences collected from diverse places online. The dataset is normalized using 'inali' normalization from [py-elotl](https://github.com/ElotlMX/py-elotl). |
|
|
|
## Usage |
|
```python |
|
from transformers import AutoModelForSeq2SeqLM |
|
from transformers import AutoTokenizer |
|
|
|
model = AutoModelForSeq2SeqLM.from_pretrained('luisarmando/mt5-large-es-nah') |
|
tokenizer = AutoTokenizer.from_pretrained('luisarmando/mt5-large-es-nah') |
|
|
|
model.eval() |
|
|
|
#Translate Spanish to Nah |
|
input_ids = tokenizer("translate spanish to nahuatl: conejo", return_tensors="pt").input_ids |
|
outputs = model.generate(input_ids.to("cuda")) |
|
tokenizer.batch_decode(outputs, skip_special_tokens=True) |
|
# outputs = tochtli |
|
|
|
#Translate Nah to Spa |
|
input_ids = tokenizer("translate nahuatl to spanish: xochitl", return_tensors="pt").input_ids |
|
outputs = model.generate(input_ids.to("cuda")) |
|
tokenizer.batch_decode(outputs, skip_special_tokens=True) |
|
# outputs = flor |
|
``` |
|
|
|
## Approach |
|
### Dataset |
|
Since the Axolotl corpus contains misalignments, the best samples were selected (12,207). |
|
This in addition to the [bible-corpus](https://github.com/christos-c/bible-corpus) (7,821). |
|
|
|
| Axolotl best aligned books | |
|
|:-----------------------------------------------------:| |
|
| Anales de Tlatelolco | |
|
| Diario | |
|
| Documentos nauas de la Ciudad de México del siglo XVI | |
|
| Historia de México narrada en náhuatl y español | |
|
| La tinta negra y roja (antología de poesía náhuatl) | |
|
| Memorial Breve (Libro las ocho relaciones) | |
|
| Método auto-didáctico náhuatl-español | |
|
| Nican Mopohua | |
|
| Quinta Relación (Libro las ocho relaciones) | |
|
| Recetario Nahua de Milpa Alta D.F | |
|
| Testimonios de la antigua palabra | |
|
| Trece Poetas del Mundo Azteca | |
|
| Una tortillita nomás - Se taxkaltsin saj | |
|
| Vida económica de Tenochtitlan | |
|
|
|
Also, additional 30,000 samples were collected from the web to enhance the data. |
|
|
|
### Model and training |
|
The employed method uses a single training stage using the mt5. This model was leveraged given that it can handle different vocabularies and prefixes. |
|
|
|
### Training |
|
The model is trained bidirectionally till convergence, adding the prefixes "translate spanish to nahuatl: + word" and "translate nahuatl to spanish: + word". |
|
This is an evolution and improvement of the [previous model](https://huggingface.co/hackathon-pln-es/t5-small-spanish-nahuatl) I collaborated on. |
|
|
|
### Training setup |
|
The model uses the same dataset for 77,500 steps using batch size = 4 and a learning rate of 1e-4. |
|
|
|
## Evaluation results |
|
The models are evaluated on 2 different datasets: |
|
1. First on the test sentences similar to the evaluation ones. |
|
2. Then, on zero-shot sentences obtained from the test sentences of AmericasNLP2021 |
|
|
|
The results are reported using CHRF++ and BLEU: |
|
|
|
| Nahuatl-Spanish Bidirectional Training | Set | BLEU | CHRF++ | |
|
|:----------------------------:|:---------------:|:-----|-------:| |
|
| True | Test | 18.01 | 54.15 | |
|
| True | Zero-shot | 5.24 | 25.7 | |
|
|
|
## References |
|
- https://github.com/christos-c/bible-corpus |
|
|
|
- https://github.com/ElotlMX/py-elotl |