murthyrudra
commited on
Commit
•
1efdafb
1
Parent(s):
845b031
Created Readme
Browse files
README.md
ADDED
@@ -0,0 +1,94 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- bn
|
4 |
+
- gu
|
5 |
+
- hi
|
6 |
+
- mr
|
7 |
+
- ne
|
8 |
+
- or
|
9 |
+
- pa
|
10 |
+
- sa
|
11 |
+
- ur
|
12 |
+
|
13 |
+
library_name: transformers
|
14 |
+
pipeline_tag: fill-mask
|
15 |
+
---
|
16 |
+
|
17 |
+
# IA-Original
|
18 |
+
|
19 |
+
IA-Original is a multilingual RoBERTa model pre-trained exclusively on 11 Indian languages from the Indo-Aryan language family. It is pre-trained on the monolingual corpora of these languages and subsequently evaluated on a set of diverse tasks.
|
20 |
+
|
21 |
+
The 11 languages covered by IA-Original are: Bhojpuri, Bengali, Gujarati, Hindi, Magahi, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Urdu.
|
22 |
+
|
23 |
+
The code can be found [here](https://github.com/IBM/NL-FM-Toolkit). For more information, check-out our [paper](https://aclanthology.org/2021.emnlp-main.675/).
|
24 |
+
|
25 |
+
|
26 |
+
## Pretraining Corpus
|
27 |
+
|
28 |
+
We pre-trained IA-Original on the publicly available monolingual corpus. The corpus has the following distribution of languages:
|
29 |
+
|
30 |
+
|
31 |
+
| **Language** | **\# Sentences** | **\# Tokens** | |
|
32 |
+
| :------------ | ---------------: | ------------: | ------------: |
|
33 |
+
| | | **\# Total** | **\# Unique** |
|
34 |
+
| Hindi (hi) | 1552\.89 | 20,098\.73 | 25\.01 |
|
35 |
+
| Bengali (bn) | 353\.44 | 4,021\.30 | 6\.5 |
|
36 |
+
| Sanskrit (sa) | 165\.35 | 1,381\.04 | 11\.13 |
|
37 |
+
| Urdu (ur) | 153\.27 | 2,465\.48 | 4\.61 |
|
38 |
+
| Marathi (mr) | 132\.93 | 1,752\.43 | 4\.92 |
|
39 |
+
| Gujarati (gu) | 131\.22 | 1,565\.08 | 4\.73 |
|
40 |
+
| Nepali (ne) | 84\.21 | 1,139\.54 | 3\.43 |
|
41 |
+
| Punjabi (pa) | 68\.02 | 945\.68 | 2\.00 |
|
42 |
+
| Oriya (or) | 17\.88 | 274\.99 | 1\.10 |
|
43 |
+
| Bhojpuri (bh) | 10\.25 | 134\.37 | 1\.13 |
|
44 |
+
| Magahi (mag) | 0\.36 | 3\.47 | 0\.15 |
|
45 |
+
|
46 |
+
|
47 |
+
|
48 |
+
## Evaluation Results
|
49 |
+
|
50 |
+
IA-Original is evaluated on IndicGLUE and some additional tasks. For more details about the tasks, refer to the [paper](https://aclanthology.org/2021.emnlp-main.675/).
|
51 |
+
|
52 |
+
|
53 |
+
|
54 |
+
## Downloads
|
55 |
+
|
56 |
+
You can also download it from [Huggingface](https://huggingface.co/ibm/ia-multilingual-original-script-roberta).
|
57 |
+
|
58 |
+
|
59 |
+
|
60 |
+
## Citing
|
61 |
+
|
62 |
+
If you are using any of the resources, please cite the following article:
|
63 |
+
|
64 |
+
```
|
65 |
+
@inproceedings{dhamecha-etal-2021-role,
|
66 |
+
title = "Role of {L}anguage {R}elatedness in {M}ultilingual {F}ine-tuning of {L}anguage {M}odels: {A} {C}ase {S}tudy in {I}ndo-{A}ryan {L}anguages",
|
67 |
+
author = "Dhamecha, Tejas and
|
68 |
+
Murthy, Rudra and
|
69 |
+
Bharadwaj, Samarth and
|
70 |
+
Sankaranarayanan, Karthik and
|
71 |
+
Bhattacharyya, Pushpak",
|
72 |
+
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
|
73 |
+
month = nov,
|
74 |
+
year = "2021",
|
75 |
+
address = "Online and Punta Cana, Dominican Republic",
|
76 |
+
publisher = "Association for Computational Linguistics",
|
77 |
+
url = "https://aclanthology.org/2021.emnlp-main.675",
|
78 |
+
doi = "10.18653/v1/2021.emnlp-main.675",
|
79 |
+
pages = "8584--8595",
|
80 |
+
}
|
81 |
+
```
|
82 |
+
|
83 |
+
## Contributors
|
84 |
+
|
85 |
+
- Tejas Dhamecha
|
86 |
+
- Rudra Murthy
|
87 |
+
- Samarth Bharadwaj
|
88 |
+
- Karthik Sankaranarayanan
|
89 |
+
- Pushpak Bhattacharyya
|
90 |
+
|
91 |
+
|
92 |
+
## Contact
|
93 |
+
|
94 |
+
- Rudra Murthy ([rmurthyv@in.ibm.com](mailto:rmurthyv@in.ibm.com))
|