Update README.md
Browse files
README.md
CHANGED
@@ -1,32 +1,26 @@
|
|
1 |
-
IndT5: A Text-to-Text Transformer for 10 Indigenous Languages
|
|
|
2 |
<img src="https://huggingface.co/UBC-NLP/IndT5/raw/main/IND_langs_large7.png" alt="drawing" width="45%" height="45%" align="right"/>
|
3 |
-
In this work, we introduce IndT5, the first Transformer language model for Indigenous languages. To train IndT5, we build IndCorpu, a new corpus for 10 Indigenous languages and Spanish.
|
|
|
4 |
|
|
|
5 |
# IndT5
|
|
|
6 |
We train an Indigenous language model adopting the unified and flexible
|
7 |
-
text-to-text transfer Transformer (T5) approach
|
8 |
text-based language task as a “text-to-text" problem, taking text format
|
9 |
as input and producing new text format as output. T5 is essentially an
|
10 |
-
encoder-decoder Transformer
|
11 |
configuration and size to a BERT<sub>Base</sub> but with some
|
12 |
architectural modifications. Modifications include applying a
|
13 |
normalization layer before a sub-block and adding a pre-norm (i.e.,
|
14 |
initial input to the sub-block output).
|
|
|
15 |
# IndCourpus
|
|
|
16 |
We build IndCorpus, a collection of 10 Indigeous languages and Spanish comprising 1.17GB of text, from both Wikipedia and the Bible.
|
17 |
-
|
18 |
-
| **Language** | **Language Code** | **Main Location** | **Number of Speakers** |
|
19 |
-
|------------------|-------------------|-------------------|------------------------|
|
20 |
-
| Aymara | aym | Bolivia | 1,677,100 |
|
21 |
-
| Asháninka | cni | Peru | 35,200 |
|
22 |
-
| Bribri | bzd | Costa Rica | 7,000 |
|
23 |
-
| Guarani | gn | Paraguay | 6,652,790 |
|
24 |
-
| Hñähñu | oto | Mexico | 88,500 |
|
25 |
-
| Nahuatl | nah | Mexico | 410,000 |
|
26 |
-
| Quechua | quy | Peru | 7,384,920 |
|
27 |
-
| Rarámuri | tar | Mexico | 9,230 |
|
28 |
-
| Shipibo-Konibo | shp | Peru | 22,500 |
|
29 |
-
| Wixarika | hch | Mexico | 52,500 |
|
30 |
### Data size and number of sentences in monolingual dataset (collected from Wikipedia and Bible)
|
31 |
| **Target Language** | **Wiki Size (MB)** | **Wiki #Sentences** | **Bible Size (MB)** | **Bible #Sentences**|
|
32 |
|-------------------|------------------|-------------------|------------------------|-|
|
|
|
1 |
+
# IndT5: A Text-to-Text Transformer for 10 Indigenous Languages
|
2 |
+
|
3 |
<img src="https://huggingface.co/UBC-NLP/IndT5/raw/main/IND_langs_large7.png" alt="drawing" width="45%" height="45%" align="right"/>
|
4 |
+
In this work, we introduce IndT5, the first Transformer language model for Indigenous languages. To train IndT5, we build IndCorpu, a new corpus for 10 Indigenous languages and Spanish.
|
5 |
+
|
6 |
|
7 |
+
|
8 |
# IndT5
|
9 |
+
|
10 |
We train an Indigenous language model adopting the unified and flexible
|
11 |
+
text-to-text transfer Transformer (T5) approach. T5 treats every
|
12 |
text-based language task as a “text-to-text" problem, taking text format
|
13 |
as input and producing new text format as output. T5 is essentially an
|
14 |
+
encoder-decoder Transformer, with the encoder and decoder similar in
|
15 |
configuration and size to a BERT<sub>Base</sub> but with some
|
16 |
architectural modifications. Modifications include applying a
|
17 |
normalization layer before a sub-block and adding a pre-norm (i.e.,
|
18 |
initial input to the sub-block output).
|
19 |
+
|
20 |
# IndCourpus
|
21 |
+
|
22 |
We build IndCorpus, a collection of 10 Indigeous languages and Spanish comprising 1.17GB of text, from both Wikipedia and the Bible.
|
23 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
### Data size and number of sentences in monolingual dataset (collected from Wikipedia and Bible)
|
25 |
| **Target Language** | **Wiki Size (MB)** | **Wiki #Sentences** | **Bible Size (MB)** | **Bible #Sentences**|
|
26 |
|-------------------|------------------|-------------------|------------------------|-|
|