UBC-NLP
/

IndT5

Inference Endpoints

text-generation-inference

Model card Files Files and versions Community

IndT5 / README.md

m-nagoudi's picture

Update README.md

01ad928 almost 3 years ago

|

raw history blame contribute delete

No virus

3.64 kB

	# IndT5: A Text-to-Text Transformer for 10 Indigenous Languages

	<img src="https://huggingface.co/UBC-NLP/IndT5/raw/main/IND_langs_large7.png" alt="drawing" width="45%" height="45%" align="right"/>
	In this work, we introduce IndT5, the first Transformer language model for Indigenous languages. To train IndT5, we build IndCorpu, a new corpus for 10 Indigenous languages and Spanish.



	# IndT5

	We train an Indigenous language model adopting the unified and flexible
	text-to-text transfer Transformer (T5) approach. T5 treats every
	text-based language task as a “text-to-text" problem, taking text format
	as input and producing new text format as output. T5 is essentially an
	encoder-decoder Transformer, with the encoder and decoder similar in
	configuration and size to a BERT<sub>Base</sub> but with some
	architectural modifications. Modifications include applying a
	normalization layer before a sub-block and adding a pre-norm (i.e.,
	initial input to the sub-block output).

	# IndCourpus

	We build IndCorpus, a collection of 10 Indigeous languages and Spanish comprising 1.17GB of text, from both Wikipedia and the Bible.

	### Data size and number of sentences in monolingual dataset (collected from Wikipedia and Bible)
	\| Target Language \| Wiki Size (MB) \| Wiki #Sentences \| Bible Size (MB) \| Bible #Sentences\|
	\|-------------------\|------------------\|-------------------\|------------------------\|-\|
	\|Hñähñu \| - \| - \| 1.4 \| 7.5K \|
	\|Wixarika \| - \| - \| 1.3 \| 7.5K\|
	\|Nahuatl \| 5.8 \| 61.1K \| 1.5 \| 7.5K\|
	\|Guarani \| 3.7 \| 28.2K \| 1.3 \| 7.5K \|
	\|Bribri \| - \| - \| 1.5 \| 7.5K \|
	\|Rarámuri \| - \| - \| 1.9 \| 7.5K \|
	\|Quechua \| 5.9 \| 97.3K \| 4.9 \| 31.1K \|
	\|Aymara \| 1.7 \| 32.9K \| 5 \| 30.7K\|
	\|Shipibo-Konibo \| - \| - \| 1 \| 7.9K \|
	\|Asháninka \| - \| - \| 1.4 \| 7.8K \|
	\|Spanish \| 1.13K \| 5M \| - \| - \|
	\|Total \| 1.15K \| 5.22M \| 19.8 \| 125.3K\|
	# Github
	More details about our model can be found here: https://github.com/UBC-NLP/IndT5




	# BibTex

	```bibtex
	@inproceedings{nagoudi-etal-2021-indt5,
	title = "{I}nd{T}5: A Text-to-Text Transformer for 10 Indigenous Languages",
	author = "Nagoudi, El Moatez Billah and Chen, Wei-Rui and Abdul-Mageed, Muhammad and Cavusoglu, Hasan",
	booktitle = "Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas",
	month = jun,
	year = "2021",
	address = "Online",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2021.americasnlp-1.30",
	doi = "10.18653/v1/2021.americasnlp-1.30",
	pages = "265--271"
	}
	```