mt5-large-es-nah / README.md
luisarmando's picture
Update README.md
f6a2543 verified
|
raw
history blame
4.67 kB
metadata
language:
  - es
  - nah
  - multilingual
license: mpl-2.0
tags:
  - translation
  - PyTorch
  - Safetensors
widget:
  - text: 'translate nahuatl to spanish: Nimitstlazohkamate'

mt5-large-spanish-nahuatl

Nahuatl is the most widely spoken indigenous language in Mexico, yet training a neural network for machine translation presents significant challenges due to insufficient structured data. Popular datasets, like Axolot and the Bible corpus, contain only approximately 16,000 and 7,000 samples, respectively. Complicating matters further, Nahuatl has multiple dialects, and a single word in the Axolot dataset can appear in over three different forms. I conclude with evaluations of the model's performance using the Chrf and BLEU metrics.

Model description

This model is an MT5 Transformer (mt5-large) fine-tuned on Spanish and Nahuatl sentences collected from diverse places online. The dataset is normalized using 'inali' normalization from py-elotl.

Inference API use

You can change and translate to nahuatl or spanish just do:

translate spanish to nahuatl: Quiero agua

or

translate nahuatl to spanish: Nimitstlazohkamate

Usage

from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained('luisarmando/mt5-large-es-nah')
tokenizer = AutoTokenizer.from_pretrained('luisarmando/mt5-large-es-nah')

model.eval()

#Translate Spanish to Nah
input_ids = tokenizer("translate spanish to nahuatl: conejo", return_tensors="pt").input_ids
outputs = model.generate(input_ids.to("cuda"))
tokenizer.batch_decode(outputs, skip_special_tokens=True)
# outputs = tochtli

#Translate Nah to Spa
input_ids = tokenizer("translate nahuatl to spanish: xochitl", return_tensors="pt").input_ids
outputs = model.generate(input_ids.to("cuda"))
tokenizer.batch_decode(outputs, skip_special_tokens=True)
# outputs = flor

Approach

Dataset

Since the Axolotl corpus contains misalignments, the best samples were selected (12,207). This in addition to the bible-corpus (7,821).

Axolotl best aligned books
Anales de Tlatelolco
Diario
Documentos nauas de la Ciudad de México del siglo XVI
Historia de México narrada en náhuatl y español
La tinta negra y roja (antología de poesía náhuatl)
Memorial Breve (Libro las ocho relaciones)
Método auto-didáctico náhuatl-español
Nican Mopohua
Quinta Relación (Libro las ocho relaciones)
Recetario Nahua de Milpa Alta D.F
Testimonios de la antigua palabra
Trece Poetas del Mundo Azteca
Una tortillita nomás - Se taxkaltsin saj
Vida económica de Tenochtitlan

Also, additional 30,000 samples were collected from the web to enhance the data.

Model and training

The employed method uses a single training stage using the mt5. This model was leveraged given that it can handle different vocabularies and prefixes.

Training

The model is trained bidirectionally till convergence, adding the prefixes "translate spanish to nahuatl: + word" and "translate nahuatl to spanish: + word". This is meant as an improvement of previous models.

Training setup

The model uses the same dataset for 77,500 steps using batch size = 4 and a learning rate of 1e-4.

Evaluation results

The models are evaluated on 2 different datasets:

  1. First on the test sentences similar to the evaluation ones.
  2. Then, on zero-shot sentences obtained from the test sentences of AmericasNLP2021

The results are reported using CHRF++ and BLEU:

Nahuatl-Spanish Bidirectional Training Set BLEU CHRF++
True Test 18.01 54.15
True Zero-shot 5.24 25.7

References