mt5-large-es-nah / README.md

luisarmando

Update README.md

9ce3c9a verified 7 months ago

preview code

raw

history blame

No virus

4.64 kB

	---
	language:
	- es
	- nah
	- multilingual
	license: mpl-2.0
	tags:
	- translation
	- PyTorch
	widget:
	- text: 'translate nahuatl to spanish: Nimitstlasohkamate'
	- text: 'translate spanish to nahuatl: agua'
	---

	# mt5-large-spanish-nahuatl
	Nahuatl is a polysynthetic macro-language from Mexico which contains 30 variants and is spoken by over 1.6M native speakers. Making it the most spoken native language in the country.

	However, training a neural network for machine translation presents significant challenges due to insufficient structured data.
	Popular datasets, like Axolot and the Bible corpus, contain only approximately 16,000 and 7,000 samples, respectively.
	Complicating matters further, is the linguistic diversity of Nahuatl itself which results in, for example, the fact that a single word in the Axolot dataset can appear in more than three different forms.

	## Inference API use
	You can change and translate to nahuatl or spanish just do:

	```
	translate spanish to nahuatl: agua
	# atl
	```
	or
	```
	translate nahuatl to spanish: Nimitstlazohkamate
	# gracias
	```

	## Model description
	This model is an MT5 Transformer ([mt5-large](https://huggingface.co/google/mt5-large)) fine-tuned on Spanish and Nahuatl sentences collected from diverse places online. The dataset is normalized using 'inali' normalization from [py-elotl](https://github.com/ElotlMX/py-elotl).

	## Usage
	```python
	from transformers import AutoModelForSeq2SeqLM
	from transformers import AutoTokenizer

	model = AutoModelForSeq2SeqLM.from_pretrained('luisarmando/mt5-large-es-nah')
	tokenizer = AutoTokenizer.from_pretrained('luisarmando/mt5-large-es-nah')

	model.eval()

	#Translate Spanish to Nah
	input_ids = tokenizer("translate spanish to nahuatl: conejo", return_tensors="pt").input_ids
	outputs = model.generate(input_ids.to("cuda"))
	tokenizer.batch_decode(outputs, skip_special_tokens=True)
	# outputs = tochtli

	#Translate Nah to Spa
	input_ids = tokenizer("translate nahuatl to spanish: xochitl", return_tensors="pt").input_ids
	outputs = model.generate(input_ids.to("cuda"))
	tokenizer.batch_decode(outputs, skip_special_tokens=True)
	# outputs = flor
	```

	## Approach
	### Dataset
	Since the Axolotl corpus contains misalignments, the best samples were selected (12,207).
	This in addition to the [bible-corpus](https://github.com/christos-c/bible-corpus) (7,821).

	\| Axolotl best aligned books \|
	\|:-----------------------------------------------------:\|
	\| Anales de Tlatelolco \|
	\| Diario \|
	\| Documentos nauas de la Ciudad de México del siglo XVI \|
	\| Historia de México narrada en náhuatl y español \|
	\| La tinta negra y roja (antología de poesía náhuatl) \|
	\| Memorial Breve (Libro las ocho relaciones) \|
	\| Método auto-didáctico náhuatl-español \|
	\| Nican Mopohua \|
	\| Quinta Relación (Libro las ocho relaciones) \|
	\| Recetario Nahua de Milpa Alta D.F \|
	\| Testimonios de la antigua palabra \|
	\| Trece Poetas del Mundo Azteca \|
	\| Una tortillita nomás - Se taxkaltsin saj \|
	\| Vida económica de Tenochtitlan \|

	Also, additional 30,000 samples were collected from the web to enhance the data.

	### Model and training
	The employed method uses a single training stage using the mt5. This model was leveraged given that it can handle different vocabularies and prefixes.

	### Training
	The model is trained bidirectionally till convergence, adding the prefixes "translate spanish to nahuatl: + word" and "translate nahuatl to spanish: + word".
	This is an evolution and improvement of the [previous model](https://huggingface.co/hackathon-pln-es/t5-small-spanish-nahuatl) I collaborated on.

	### Training setup
	The model uses the same dataset for 77,500 steps using batch size = 4 and a learning rate of 1e-4.

	## Evaluation results
	The models are evaluated on 2 different datasets:
	1. First on the test sentences similar to the evaluation ones.
	2. Then, on zero-shot sentences obtained from the test sentences of AmericasNLP2021

	The results are reported using CHRF++ and BLEU:

	\| Nahuatl-Spanish Bidirectional Training \| Set \| BLEU \| CHRF++ \|
	\|:----------------------------:\|:---------------:\|:-----\|-------:\|
	\| True \| Test \| 18.01 \| 54.15 \|
	\| True \| Zero-shot \| 5.24 \| 25.7 \|

	## References
	- https://github.com/christos-c/bible-corpus

	- https://github.com/ElotlMX/py-elotl