IndT5: A Text-to-Text Transformer for 10 Indigenous Languages
In this work, we introduce IndT5, the first Transformer language model for Indigenous languages. To train IndT5, we build IndCorpu, a new corpus for 10 Indigenous languages and Spanish.
IndT5
We train an Indigenous language model adopting the unified and flexible text-to-text transfer Transformer (T5) approach. T5 treats every text-based language task as a “text-to-text" problem, taking text format as input and producing new text format as output. T5 is essentially an encoder-decoder Transformer, with the encoder and decoder similar in configuration and size to a BERTBase but with some architectural modifications. Modifications include applying a normalization layer before a sub-block and adding a pre-norm (i.e., initial input to the sub-block output).
IndCourpus
We build IndCorpus, a collection of 10 Indigeous languages and Spanish comprising 1.17GB of text, from both Wikipedia and the Bible.
Data size and number of sentences in monolingual dataset (collected from Wikipedia and Bible)
Target Language | Wiki Size (MB) | Wiki #Sentences | Bible Size (MB) | Bible #Sentences |
---|---|---|---|---|
Hñähñu | - | - | 1.4 | 7.5K |
Wixarika | - | - | 1.3 | 7.5K |
Nahuatl | 5.8 | 61.1K | 1.5 | 7.5K |
Guarani | 3.7 | 28.2K | 1.3 | 7.5K |
Bribri | - | - | 1.5 | 7.5K |
Rarámuri | - | - | 1.9 | 7.5K |
Quechua | 5.9 | 97.3K | 4.9 | 31.1K |
Aymara | 1.7 | 32.9K | 5 | 30.7K |
Shipibo-Konibo | - | - | 1 | 7.9K |
Asháninka | - | - | 1.4 | 7.8K |
Spanish | 1.13K | 5M | - | - |
Total | 1.15K | 5.22M | 19.8 | 125.3K |
Github
More details about our model can be found here: https://github.com/UBC-NLP/IndT5
BibTex
@inproceedings{nagoudi-etal-2021-indt5,
title = "{I}nd{T}5: A Text-to-Text Transformer for 10 Indigenous Languages",
author = "Nagoudi, El Moatez Billah and
Chen, Wei-Rui and
Abdul-Mageed, Muhammad and
Cavusoglu, Hasan",
booktitle = "Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.americasnlp-1.30",
doi = "10.18653/v1/2021.americasnlp-1.30",
pages = "265--271"
}