m-nagoudi commited on
Commit
58e2f85
1 Parent(s): fa4c2a0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -17
README.md CHANGED
@@ -1,32 +1,26 @@
1
- IndT5: A Text-to-Text Transformer for 10 Indigenous Languages
 
2
  <img src="https://huggingface.co/UBC-NLP/IndT5/raw/main/IND_langs_large7.png" alt="drawing" width="45%" height="45%" align="right"/>
3
- In this work, we introduce IndT5, the first Transformer language model for Indigenous languages. To train IndT5, we build IndCorpu, a new corpus for 10 Indigenous languages and Spanish. We also present the application of IndT5 to machine translation by investigating different approaches to translate between Spanish and the Indigenous languages as part of our contribution to theAmericasNLP 2021 Shared Task on OpenMachine Translation.
 
4
  &nbsp;
 
5
  # IndT5
 
6
  We train an Indigenous language model adopting the unified and flexible
7
- text-to-text transfer Transformer (T5) approach . T5 treats every
8
  text-based language task as a “text-to-text" problem, taking text format
9
  as input and producing new text format as output. T5 is essentially an
10
- encoder-decoder Transformer , with the encoder and decoder similar in
11
  configuration and size to a BERT<sub>Base</sub> but with some
12
  architectural modifications. Modifications include applying a
13
  normalization layer before a sub-block and adding a pre-norm (i.e.,
14
  initial input to the sub-block output).
 
15
  # IndCourpus
 
16
  We build IndCorpus, a collection of 10 Indigeous languages and Spanish comprising 1.17GB of text, from both Wikipedia and the Bible.
17
- ### Demographic information of 10 Indigenous languages
18
- | **Language** | **Language Code** | **Main Location** | **Number of Speakers** |
19
- |------------------|-------------------|-------------------|------------------------|
20
- | Aymara | aym | Bolivia | 1,677,100 |
21
- | Asháninka | cni | Peru | 35,200 |
22
- | Bribri | bzd | Costa Rica | 7,000 |
23
- | Guarani | gn | Paraguay | 6,652,790 |
24
- | Hñähñu | oto | Mexico | 88,500 |
25
- | Nahuatl | nah | Mexico | 410,000 |
26
- | Quechua | quy | Peru | 7,384,920 |
27
- | Rarámuri | tar | Mexico | 9,230 |
28
- | Shipibo-Konibo | shp | Peru | 22,500 |
29
- | Wixarika | hch | Mexico | 52,500 |
30
  ### Data size and number of sentences in monolingual dataset (collected from Wikipedia and Bible)
31
  | **Target Language** | **Wiki Size (MB)** | **Wiki #Sentences** | **Bible Size (MB)** | **Bible #Sentences**|
32
  |-------------------|------------------|-------------------|------------------------|-|
 
1
+ # IndT5: A Text-to-Text Transformer for 10 Indigenous Languages
2
+ &nbsp;
3
  <img src="https://huggingface.co/UBC-NLP/IndT5/raw/main/IND_langs_large7.png" alt="drawing" width="45%" height="45%" align="right"/>
4
+ In this work, we introduce IndT5, the first Transformer language model for Indigenous languages. To train IndT5, we build IndCorpu, a new corpus for 10 Indigenous languages and Spanish.
5
+
6
  &nbsp;
7
+
8
  # IndT5
9
+
10
  We train an Indigenous language model adopting the unified and flexible
11
+ text-to-text transfer Transformer (T5) approach. T5 treats every
12
  text-based language task as a “text-to-text" problem, taking text format
13
  as input and producing new text format as output. T5 is essentially an
14
+ encoder-decoder Transformer, with the encoder and decoder similar in
15
  configuration and size to a BERT<sub>Base</sub> but with some
16
  architectural modifications. Modifications include applying a
17
  normalization layer before a sub-block and adding a pre-norm (i.e.,
18
  initial input to the sub-block output).
19
+
20
  # IndCourpus
21
+
22
  We build IndCorpus, a collection of 10 Indigeous languages and Spanish comprising 1.17GB of text, from both Wikipedia and the Bible.
23
+
 
 
 
 
 
 
 
 
 
 
 
 
24
  ### Data size and number of sentences in monolingual dataset (collected from Wikipedia and Bible)
25
  | **Target Language** | **Wiki Size (MB)** | **Wiki #Sentences** | **Bible Size (MB)** | **Bible #Sentences**|
26
  |-------------------|------------------|-------------------|------------------------|-|