vaishnavi commited on
Commit
5f753cc
1 Parent(s): 7467ee0

initial commit

Browse files
README.md ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ datasets:
5
+ - AI4Bharat IndicNLP Corpora
6
+ ---
7
+
8
+ # IndicBERT
9
+
10
+ IndicBERT is a multilingual ALBERT model pretrained exclusively on 12 major Indian languages. It is pre-trained on our novel monolingual corpus of around 9 billion tokens and subsequently evaluated on a set of diverse tasks. IndicBERT has much fewer parameters than other multilingual models (mBERT, XLM-R etc.) while it also achieves a performance on-par or better than these models.
11
+
12
+ The 12 languages covered by IndicBERT are: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
13
+
14
+ The code can be found [here](https://github.com/divkakwani/indic-bert). For more information, checkout our [project page](https://indicnlp.ai4bharat.org/) or our [paper](https://indicnlp.ai4bharat.org/papers/arxiv2020_indicnlp_corpus.pdf).
15
+
16
+
17
+
18
+ ## Pretraining Corpus
19
+
20
+ We pre-trained indic-bert on AI4Bharat's monolingual corpus. The corpus has the following distribution of languages:
21
+
22
+
23
+ | Language | as | bn | en | gu | hi | kn | |
24
+ | ----------------- | ------ | ------ | ------ | ------ | ------ | ------ | ------- |
25
+ | **No. of Tokens** | 36.9M | 815M | 1.34B | 724M | 1.84B | 712M | |
26
+ | **Language** | **ml** | **mr** | **or** | **pa** | **ta** | **te** | **all** |
27
+ | **No. of Tokens** | 767M | 560M | 104M | 814M | 549M | 671M | 8.9B |
28
+
29
+
30
+
31
+ ## Evaluation Results
32
+
33
+ IndicBERT is evaluated on IndicGLUE and some additional tasks. The results are summarized below. For more details about the tasks, refer our [official repo](https://github.com/divkakwani/indic-bert)
34
+
35
+ #### IndicGLUE
36
+
37
+ Task | mBERT | XLM-R | IndicBERT
38
+ -----| ----- | ----- | ------
39
+ News Article Headline Prediction | 89.58 | 95.52 | **95.87**
40
+ Wikipedia Section Title Prediction| **73.66** | 66.33 | 73.31
41
+ Cloze-style multiple-choice QA | 39.16 | 27.98 | **41.87**
42
+ Article Genre Classification | 90.63 | 97.03 | **97.34**
43
+ Named Entity Recognition (F1-score) | **73.24** | 65.93 | 64.47
44
+ Cross-Lingual Sentence Retrieval Task | 21.46 | 13.74 | **27.12**
45
+ Average | 64.62 | 61.09 | **66.66**
46
+
47
+ #### Additional Tasks
48
+
49
+
50
+ Task | Task Type | mBERT | XLM-R | IndicBERT
51
+ -----| ----- | ----- | ------ | -----
52
+ BBC News Classification | Genre Classification | 60.55 | **75.52** | 74.60
53
+ IIT Product Reviews | Sentiment Analysis | 74.57 | **78.97** | 71.32
54
+ IITP Movie Reviews | Sentiment Analaysis | 56.77 | **61.61** | 59.03
55
+ Soham News Article | Genre Classification | 80.23 | **87.6** | 78.45
56
+ Midas Discourse | Discourse Analysis | 71.20 | **79.94** | 78.44
57
+ iNLTK Headlines Classification | Genre Classification | 87.95 | 93.38 | **94.52**
58
+ ACTSA Sentiment Analysis | Sentiment Analysis | 48.53 | 59.33 | **61.18**
59
+ Winograd NLI | Natural Language Inference | 56.34 | 55.87 | **56.34**
60
+ Choice of Plausible Alternative (COPA) | Natural Language Inference | 54.92 | 51.13 | **58.33**
61
+ Amrita Exact Paraphrase | Paraphrase Detection | **93.81** | 93.02 | 93.75
62
+ Amrita Rough Paraphrase | Paraphrase Detection | 83.38 | 82.20 | **84.33**
63
+ Average | | 69.84 | **74.42** | 73.66
64
+
65
+
66
+ \* Note: all models have been restricted to a max_seq_length of 128.
67
+
68
+
69
+
70
+ ## Downloads
71
+
72
+ The model can be downloaded [here](https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/models/indic-bert-v1.tar.gz). Both tf checkpoints and pytorch binaries are included in the archive. Alternatively, you can also download it from [Huggingface](https://huggingface.co/ai4bharat/indic-bert).
73
+
74
+
75
+
76
+ ## Citing
77
+
78
+ If you are using any of the resources, please cite the following article:
79
+
80
+ ```
81
+ @inproceedings{kakwani2020indicnlpsuite,
82
+ title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
83
+ author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
84
+ year={2020},
85
+ booktitle={Findings of EMNLP},
86
+ }
87
+ ```
88
+
89
+ We would like to hear from you if:
90
+
91
+ - You are using our resources. Please let us know how you are putting these resources to use.
92
+ - You have any feedback on these resources.
93
+
94
+
95
+
96
+ ## License
97
+
98
+ The IndicBERT code (and models) are released under the MIT License.
99
+
100
+ ## Contributors
101
+
102
+ - Divyanshu Kakwani
103
+ - Anoop Kunchukuttan
104
+ - Gokul NC
105
+ - Satish Golla
106
+ - Avik Bhattacharyya
107
+ - Mitesh Khapra
108
+ - Pratyush Kumar
109
+
110
+ This work is the outcome of a volunteer effort as part of [AI4Bharat initiative](https://ai4bharat.org).
111
+
112
+
113
+
114
+ ## Contact
115
+
116
+ - Anoop Kunchukuttan ([anoop.kunchukuttan@gmail.com](mailto:anoop.kunchukuttan@gmail.com))
117
+ - Mitesh Khapra ([miteshk@cse.iitm.ac.in](mailto:miteshk@cse.iitm.ac.in))
118
+ - Pratyush Kumar ([pratyush@cse.iitm.ac.in](mailto:pratyush@cse.iitm.ac.in))
config.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "albert",
3
+ "attention_probs_dropout_prob": 0,
4
+ "hidden_act": "gelu",
5
+ "hidden_dropout_prob": 0,
6
+ "embedding_size": 128,
7
+ "hidden_size": 768,
8
+ "initializer_range": 0.02,
9
+ "intermediate_size": 3072,
10
+ "max_position_embeddings": 512,
11
+ "num_attention_heads": 12,
12
+ "num_hidden_layers": 12,
13
+ "num_hidden_groups": 1,
14
+ "net_structure_type": 0,
15
+ "gap_size": 0,
16
+ "num_memory_blocks": 0,
17
+ "inner_group_num": 1,
18
+ "down_scale_factor": 1,
19
+ "type_vocab_size": 2,
20
+ "vocab_size": 200000
21
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:94c747585b0126ba2886423f69844e03cb7a4f198f6e15c42afedbccbf80a138
3
+ size 134982446
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3a1173c2b6e144a02c001e289a05b5dbefddf247c50d4dcf42633158b2968fcb
3
+ size 5646064
spiece.vocab ADDED
The diff for this file is too large to render. See raw diff
tf_model.ckpt.data-00000-of-00001 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:51a52801f044b66e11ce318b2d20463b1ef0a63723cba2228677306bfa65aa8e
3
+ size 400182536
tf_model.ckpt.index ADDED
Binary file (1.87 kB). View file
tf_model.ckpt.meta ADDED
Binary file (2.2 MB). View file