simran-kh commited on
Commit
5e1c4af
1 Parent(s): 49973d4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +204 -0
README.md ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ thumbnail: https://huggingface.co/front/thumbnails/google.png
3
+ license: apache-2.0
4
+ ---
5
+ MuRIL: Multilingual Representations for Indian Languages
6
+ ===
7
+ MuRIL is a BERT model pre-trained on 17 Indian languages and their transliterated counterparts. We have released the pre-trained model (with the MLM layer intact, enabling masked word predictions) in this repository. We have also released the encoder on [TFHub](https://tfhub.dev/google/MuRIL/1) with an additional pre-processing module, that processes raw text into the expected input format for the encoder. You can find more details on MuRIL in this [paper](http://arxiv.org/abs/2103.10730).
8
+
9
+ ## Overview
10
+
11
+ This model uses a BERT base architecture [1] pretrained from scratch using the
12
+ Wikipedia [2], Common Crawl [3], PMINDIA [4] and Dakshina [5] corpora for 17 [6]
13
+ Indian languages.
14
+
15
+ We use a training paradigm similar to multilingual bert, with a few
16
+ modifications as listed:
17
+
18
+ * We include translation and transliteration segment pairs in training as
19
+ well.
20
+ * We keep an exponent value of 0.3 and not 0.7 for upsampling, shown to
21
+ enhance low-resource performance. [7]
22
+
23
+ See the Training section for more details.
24
+
25
+ ## Training
26
+
27
+ The MuRIL model is pre-trained on monolingual segments as well as parallel
28
+ segments as detailed below :
29
+
30
+ * Monolingual Data : We make use of publicly available corpora from Wikipedia
31
+ and Common Crawl for 17 Indian languages.
32
+ * Parallel Data : We have two types of parallel data :
33
+ * Translated Data : We obtain translations of the above monolingual
34
+ corpora using the Google NMT pipeline. We feed translated segment pairs
35
+ as input. We also make use of the publicly available PMINDIA corpus.
36
+ * Transliterated Data : We obtain transliterations of Wikipedia using the
37
+ IndicTrans [8] library. We feed transliterated segment pairs as input.
38
+ We also make use of the publicly available Dakshina dataset.
39
+
40
+ We keep an exponent value of 0.3 to calculate duplication multiplier values for
41
+ upsampling of lower resourced languages and set dupe factors accordingly. Note,
42
+ we limit transliterated pairs to Wikipedia only.
43
+
44
+ The model was trained using a self-supervised masked language modeling task. We
45
+ do whole word masking with a maximum of 80 predictions. The model was trained
46
+ for 1000K steps, with a batch size of 4096, and a max sequence length of 512.
47
+
48
+ ### Trainable parameters
49
+
50
+ All parameters in the module are trainable, and fine-tuning all parameters is
51
+ the recommended practice.
52
+
53
+ ## Uses & Limitations
54
+
55
+ This model is intended to be used for a variety of downstream NLP tasks for
56
+ Indian languages. This model is trained on transliterated data as well, a
57
+ phenomomenon commonly observed in the Indian context. This model is not expected
58
+ to perform well on languages other than the ones used in pretraining, i.e. 17
59
+ Indian languages.
60
+
61
+ ## Evaluation
62
+
63
+ We provide the results of fine-tuning this model on a set of downstream tasks.<br/>
64
+ We choose these tasks from the XTREME benchmark, with evaluation done on Indian language test-sets.<br/>
65
+ We also transliterate the test-sets and evaluate on the same.<br/>
66
+ We use the same fine-tuning setting as is used by [9], except for TyDiQA, where we use additional SQuAD v1.1 English training data, similar to [10].<br/>
67
+ For Tatoeba, we do not fine-tune the model, and use the pooled_output of the last layer as the sentence embedding.<br/>
68
+ All results are computed in a zero-shot setting, with English being the high resource training set language.
69
+
70
+ * Shown below are results on datasets from the XTREME benchmark (in %)
71
+ <br/>
72
+
73
+ PANX (F1) | ml | ta | te | en | bn | hi | mr | ur | Average
74
+ :-------- | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ------:
75
+ mBERT | 54.77 | 51.24 | 50.16 | 84.40 | 68.59 | 65.13 | 58.44 | 31.36 | 58.01
76
+ MuRIL | 75.74 | 71.86 | 64.99 | 84.43 | 85.97 | 78.09 | 74.63 | 85.07 | 77.60
77
+
78
+ <br/>
79
+
80
+ UDPOS (F1) | en | hi | mr | ta | te | ur | Average
81
+ :--------- | ----: | ----: | ----: | ----: | ----: | ----: | ------:
82
+ mBERT | 95.35 | 66.09 | 71.27 | 59.58 | 76.98 | 57.85 | 71.19
83
+ MuRIL | 95.55 | 64.47 | 82.95 | 62.57 | 85.63 | 58.93 | 75.02
84
+
85
+ <br/>
86
+
87
+ XNLI (Accuracy) | en | hi | ur | Average
88
+ :-------------- | ----: | ----: | ----: | ------:
89
+ mBERT | 81.72 | 60.52 | 58.20 | 66.81
90
+ MuRIL | 83.85 | 70.66 | 67.70 | 74.07
91
+
92
+ <br/>
93
+
94
+ Tatoeba (Accuracy) | ml | ta | te | bn | hi | mr | ur | Average
95
+ :----------------- | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ------:
96
+ mBERT | 20.23 | 12.38 | 14.96 | 12.80 | 27.80 | 18.00 | 22.70 | 18.41
97
+ MuRIL | 26.35 | 36.81 | 17.52 | 20.20 | 31.50 | 26.60 | 17.10 | 25.15
98
+
99
+ <br/>
100
+
101
+ XQUAD (F1/EM) | en | hi | Average
102
+ :------------ | ----------: | ----------: | ----------:
103
+ mBERT | 83.85/72.86 | 58.46/43.53 | 71.15/58.19
104
+ MuRIL | 84.31/72.94 | 73.93/58.32 | 79.12/65.63
105
+
106
+ <br/>
107
+
108
+ MLQA (F1/EM) | en | hi | Average
109
+ :----------- | ----------: | ----------: | ----------:
110
+ mBERT | 80.39/67.30 | 50.28/35.18 | 65.34/51.24
111
+ MuRIL | 80.28/67.37 | 67.34/50.22 | 73.81/58.80
112
+
113
+ <br/>
114
+
115
+ TyDiQA (F1/EM) | en | bn | te | Average
116
+ :---------------- | ----------: | ----------: | ----------: | ----------:
117
+ mBERT | 75.21/65.00 | 60.62/45.13 | 53.55/44.54 | 63.13/51.66
118
+ MuRIL | 74.10/64.55 | 78.03/66.37 | 73.95/46.94 | 75.36/59.28
119
+
120
+ * Shown below are results on the transliterated versions of the above
121
+ test-sets.
122
+
123
+ PANX (F1) | ml_tr | ta_tr | te_tr | bn_tr | hi_tr | mr_tr | ur_tr | Average
124
+ :-------- | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ------:
125
+ mBERT | 7.53 | 1.04 | 8.24 | 41.77 | 25.46 | 8.34 | 7.30 | 14.24
126
+ MuRIL | 63.39 | 7.00 | 53.62 | 72.94 | 69.75 | 68.77 | 68.41 | 57.70
127
+
128
+ <br/>
129
+
130
+ UDPOS (F1) | hi_tr | mr_tr | ta_tr | te_tr | ur_tr | Average
131
+ :--------- | ----: | ----: | ----: | ----: | ----: | ------:
132
+ mBERT | 25.00 | 33.67 | 24.02 | 36.21 | 22.07 | 28.20
133
+ MuRIL | 63.09 | 67.19 | 58.40 | 65.30 | 56.49 | 62.09
134
+
135
+ <br/>
136
+
137
+ XNLI (Accuracy) | hi_tr | ur_tr | Average
138
+ :-------------- | ----: | ----: | ------:
139
+ mBERT | 39.6 | 38.86 | 39.23
140
+ MuRIL | 68.24 | 61.16 | 64.70
141
+
142
+ <br/>
143
+
144
+ Tatoeba (Accuracy) | ml_tr | ta_tr | te_tr | bn_tr | hi_tr | mr_tr | ur_tr | Average
145
+ :----------------- | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ------:
146
+ mBERT | 2.18 | 1.95 | 5.13 | 1.80 | 3.00 | 2.40 | 2.30 | 2.68
147
+ MuRIL | 10.33 | 11.07 | 11.54 | 8.10 | 14.90 | 7.20 | 13.70 | 10.98
148
+
149
+ <br/>
150
+
151
+ ## References
152
+
153
+ \[1]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. [BERT:
154
+ Pre-training of Deep Bidirectional Transformers for Language
155
+ Understanding](https://arxiv.org/abs/1810.04805). arXiv preprint
156
+ arXiv:1810.04805, 2018.
157
+
158
+ \[2]: [Wikipedia](https://www.tensorflow.org/datasets/catalog/wikipedia)
159
+
160
+ \[3]: [Common Crawl](http://commoncrawl.org/the-data/)
161
+
162
+ \[4]:
163
+ [PMINDIA](http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual/index.html)
164
+
165
+ \[5]: [Dakshina](https://github.com/google-research-datasets/dakshina)
166
+
167
+ \[6]: Assamese (as), Bengali (bn), English (en), Gujarati (gu), Hindi (hi),
168
+ Kannada (kn), Kashmiri (ks), Malayalam (ml), Marathi (mr), Nepali (ne), Oriya
169
+ (or), Punjabi (pa), Sanskrit (sa), Sindhi (sd), Tamil (ta), Telugu (te) and Urdu
170
+ (ur).
171
+
172
+ \[7]: Conneau, Alexis, et al.
173
+ [Unsupervised cross-lingual representation learning at scale](https://arxiv.org/pdf/1911.02116.pdf).
174
+ arXiv preprint arXiv:1911.02116 (2019).
175
+
176
+ \[8]: [IndicTrans](https://github.com/libindic/indic-trans)
177
+
178
+ \[9]: Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., & Johnson, M.
179
+ (2020). [Xtreme: A massively multilingual multi-task benchmark for evaluating
180
+ cross-lingual generalization.](https://arxiv.org/pdf/2003.11080.pdf) arXiv
181
+ preprint arXiv:2003.11080.
182
+
183
+ \[10]: Fang, Y., Wang, S., Gan, Z., Sun, S., & Liu, J. (2020).
184
+ [FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding.](https://arxiv.org/pdf/2009.05166.pdf)
185
+ arXiv preprint arXiv:2009.05166.
186
+
187
+ ## Citation
188
+
189
+ If you find MuRIL useful in your applications, please cite the following paper:
190
+
191
+ ```
192
+ @misc{khanuja2021muril,
193
+ title={MuRIL: Multilingual Representations for Indian Languages},
194
+ author={Simran Khanuja and Diksha Bansal and Sarvesh Mehtani and Savya Khosla and Atreyee Dey and Balaji Gopalan and Dilip Kumar Margam and Pooja Aggarwal and Rajiv Teja Nagipogu and Shachi Dave and Shruti Gupta and Subhash Chandra Bose Gali and Vish Subramanian and Partha Talukdar},
195
+ year={2021},
196
+ eprint={2103.10730},
197
+ archivePrefix={arXiv},
198
+ primaryClass={cs.CL}
199
+ }
200
+ ```
201
+
202
+ ## Contact
203
+
204
+ Please mail your queries/feedback to muril-contact@google.com.