File size: 3,641 Bytes
58e2f85
 
e7b8eb6
58e2f85
 
e7b8eb6
58e2f85
e7b8eb6
58e2f85
e7b8eb6
58e2f85
e7b8eb6
 
58e2f85
e7b8eb6
 
 
 
58e2f85
e7b8eb6
58e2f85
e7b8eb6
58e2f85
e7b8eb6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fa4c2a0
 
 
 
e7b8eb6
fa4c2a0
 
 
 
01ad928
fa4c2a0
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
 # IndT5: A Text-to-Text Transformer for 10 Indigenous Languages
  
<img src="https://huggingface.co/UBC-NLP/IndT5/raw/main/IND_langs_large7.png" alt="drawing" width="45%" height="45%" align="right"/>
In this work,  we introduce IndT5, the  first  Transformer  language  model  for  Indigenous languages.  To train IndT5, we build IndCorpu,  a  new  corpus  for 10 Indigenous languages  and  Spanish. 

&nbsp;

# IndT5

We train an Indigenous language model adopting the unified and flexible
text-to-text transfer Transformer (T5) approach. T5 treats every
text-based language task as a “text-to-text" problem, taking text format
as input and producing new text format as output. T5 is essentially an
encoder-decoder Transformer, with the encoder and decoder similar in
configuration and size to a BERT<sub>Base</sub> but with some
architectural modifications. Modifications include applying a
normalization layer before a sub-block and adding a pre-norm (i.e.,
initial input to the sub-block output).

# IndCourpus

We build IndCorpus, a collection of 10 Indigeous languages and Spanish comprising 1.17GB of text, from both Wikipedia and the Bible.

### Data size and number of sentences in monolingual dataset (collected from Wikipedia and Bible)
| **Target Language** | **Wiki Size (MB)**        | **Wiki #Sentences**           | **Bible Size (MB)**  | **Bible #Sentences**|
|-------------------|------------------|-------------------|------------------------|-|
|Hñähñu                   | -                |    -                             | 1.4     |    7.5K                                          |
|Wixarika                 | -            |       -                             |  1.3   |   7.5K|
|Nahuatl                  | 5.8           |    61.1K                         | 1.5  |      7.5K|
|Guarani                  | 3.7            |      28.2K                           | 1.3 |      7.5K                                              |
|Bribri                   | -               |    -                             | 1.5  |        7.5K                                        |
|Rarámuri                 | -                |     -                            | 1.9  |         7.5K                                       |
|Quechua                  | 5.9               |     97.3K                        | 4.9   |    31.1K                                            |
|Aymara                   | 1.7                |     32.9K                         | 5   | 30.7K|
|Shipibo-Konibo           | -                   |     -                         | 1    |    7.9K                                             |
|Asháninka                | -                    |     -                       | 1.4    |   7.8K                                          |
|Spanish                      | 1.13K             |    5M    | -              | - |
|Total | 1.15K  |  5.22M   |    19.8 |     125.3K|
# Github
More details about our model can be found here: https://github.com/UBC-NLP/IndT5




# BibTex

```bibtex
@inproceedings{nagoudi-etal-2021-indt5,
    title = "{I}nd{T}5: A Text-to-Text Transformer for 10 Indigenous Languages",
    author = "Nagoudi, El Moatez Billah  and Chen, Wei-Rui  and Abdul-Mageed, Muhammad  and Cavusoglu, Hasan",
    booktitle = "Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.americasnlp-1.30",
    doi = "10.18653/v1/2021.americasnlp-1.30",
    pages = "265--271"
}
```