murthyrudra commited on
Commit
1efdafb
1 Parent(s): 845b031

Created Readme

Browse files
Files changed (1) hide show
  1. README.md +94 -0
README.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - bn
4
+ - gu
5
+ - hi
6
+ - mr
7
+ - ne
8
+ - or
9
+ - pa
10
+ - sa
11
+ - ur
12
+
13
+ library_name: transformers
14
+ pipeline_tag: fill-mask
15
+ ---
16
+
17
+ # IA-Original
18
+
19
+ IA-Original is a multilingual RoBERTa model pre-trained exclusively on 11 Indian languages from the Indo-Aryan language family. It is pre-trained on the monolingual corpora of these languages and subsequently evaluated on a set of diverse tasks.
20
+
21
+ The 11 languages covered by IA-Original are: Bhojpuri, Bengali, Gujarati, Hindi, Magahi, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Urdu.
22
+
23
+ The code can be found [here](https://github.com/IBM/NL-FM-Toolkit). For more information, check-out our [paper](https://aclanthology.org/2021.emnlp-main.675/).
24
+
25
+
26
+ ## Pretraining Corpus
27
+
28
+ We pre-trained IA-Original on the publicly available monolingual corpus. The corpus has the following distribution of languages:
29
+
30
+
31
+ | **Language** | **\# Sentences** | **\# Tokens** | |
32
+ | :------------ | ---------------: | ------------: | ------------: |
33
+ | | | **\# Total** | **\# Unique** |
34
+ | Hindi (hi) | 1552\.89 | 20,098\.73 | 25\.01 |
35
+ | Bengali (bn) | 353\.44 | 4,021\.30 | 6\.5 |
36
+ | Sanskrit (sa) | 165\.35 | 1,381\.04 | 11\.13 |
37
+ | Urdu (ur) | 153\.27 | 2,465\.48 | 4\.61 |
38
+ | Marathi (mr) | 132\.93 | 1,752\.43 | 4\.92 |
39
+ | Gujarati (gu) | 131\.22 | 1,565\.08 | 4\.73 |
40
+ | Nepali (ne) | 84\.21 | 1,139\.54 | 3\.43 |
41
+ | Punjabi (pa) | 68\.02 | 945\.68 | 2\.00 |
42
+ | Oriya (or) | 17\.88 | 274\.99 | 1\.10 |
43
+ | Bhojpuri (bh) | 10\.25 | 134\.37 | 1\.13 |
44
+ | Magahi (mag) | 0\.36 | 3\.47 | 0\.15 |
45
+
46
+
47
+
48
+ ## Evaluation Results
49
+
50
+ IA-Original is evaluated on IndicGLUE and some additional tasks. For more details about the tasks, refer to the [paper](https://aclanthology.org/2021.emnlp-main.675/).
51
+
52
+
53
+
54
+ ## Downloads
55
+
56
+ You can also download it from [Huggingface](https://huggingface.co/ibm/ia-multilingual-original-script-roberta).
57
+
58
+
59
+
60
+ ## Citing
61
+
62
+ If you are using any of the resources, please cite the following article:
63
+
64
+ ```
65
+ @inproceedings{dhamecha-etal-2021-role,
66
+ title = "Role of {L}anguage {R}elatedness in {M}ultilingual {F}ine-tuning of {L}anguage {M}odels: {A} {C}ase {S}tudy in {I}ndo-{A}ryan {L}anguages",
67
+ author = "Dhamecha, Tejas and
68
+ Murthy, Rudra and
69
+ Bharadwaj, Samarth and
70
+ Sankaranarayanan, Karthik and
71
+ Bhattacharyya, Pushpak",
72
+ booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
73
+ month = nov,
74
+ year = "2021",
75
+ address = "Online and Punta Cana, Dominican Republic",
76
+ publisher = "Association for Computational Linguistics",
77
+ url = "https://aclanthology.org/2021.emnlp-main.675",
78
+ doi = "10.18653/v1/2021.emnlp-main.675",
79
+ pages = "8584--8595",
80
+ }
81
+ ```
82
+
83
+ ## Contributors
84
+
85
+ - Tejas Dhamecha
86
+ - Rudra Murthy
87
+ - Samarth Bharadwaj
88
+ - Karthik Sankaranarayanan
89
+ - Pushpak Bhattacharyya
90
+
91
+
92
+ ## Contact
93
+
94
+ - Rudra Murthy ([rmurthyv@in.ibm.com](mailto:rmurthyv@in.ibm.com))