initial commit
Browse files- README.md +118 -0
- config.json +21 -0
- pytorch_model.bin +3 -0
- spiece.model +3 -0
- spiece.vocab +0 -0
- tf_model.ckpt.data-00000-of-00001 +3 -0
- tf_model.ckpt.index +0 -0
- tf_model.ckpt.meta +0 -0
README.md
ADDED
@@ -0,0 +1,118 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: en
|
3 |
+
license: mit
|
4 |
+
datasets:
|
5 |
+
- AI4Bharat IndicNLP Corpora
|
6 |
+
---
|
7 |
+
|
8 |
+
# IndicBERT
|
9 |
+
|
10 |
+
IndicBERT is a multilingual ALBERT model pretrained exclusively on 12 major Indian languages. It is pre-trained on our novel monolingual corpus of around 9 billion tokens and subsequently evaluated on a set of diverse tasks. IndicBERT has much fewer parameters than other multilingual models (mBERT, XLM-R etc.) while it also achieves a performance on-par or better than these models.
|
11 |
+
|
12 |
+
The 12 languages covered by IndicBERT are: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
|
13 |
+
|
14 |
+
The code can be found [here](https://github.com/divkakwani/indic-bert). For more information, checkout our [project page](https://indicnlp.ai4bharat.org/) or our [paper](https://indicnlp.ai4bharat.org/papers/arxiv2020_indicnlp_corpus.pdf).
|
15 |
+
|
16 |
+
|
17 |
+
|
18 |
+
## Pretraining Corpus
|
19 |
+
|
20 |
+
We pre-trained indic-bert on AI4Bharat's monolingual corpus. The corpus has the following distribution of languages:
|
21 |
+
|
22 |
+
|
23 |
+
| Language | as | bn | en | gu | hi | kn | |
|
24 |
+
| ----------------- | ------ | ------ | ------ | ------ | ------ | ------ | ------- |
|
25 |
+
| **No. of Tokens** | 36.9M | 815M | 1.34B | 724M | 1.84B | 712M | |
|
26 |
+
| **Language** | **ml** | **mr** | **or** | **pa** | **ta** | **te** | **all** |
|
27 |
+
| **No. of Tokens** | 767M | 560M | 104M | 814M | 549M | 671M | 8.9B |
|
28 |
+
|
29 |
+
|
30 |
+
|
31 |
+
## Evaluation Results
|
32 |
+
|
33 |
+
IndicBERT is evaluated on IndicGLUE and some additional tasks. The results are summarized below. For more details about the tasks, refer our [official repo](https://github.com/divkakwani/indic-bert)
|
34 |
+
|
35 |
+
#### IndicGLUE
|
36 |
+
|
37 |
+
Task | mBERT | XLM-R | IndicBERT
|
38 |
+
-----| ----- | ----- | ------
|
39 |
+
News Article Headline Prediction | 89.58 | 95.52 | **95.87**
|
40 |
+
Wikipedia Section Title Prediction| **73.66** | 66.33 | 73.31
|
41 |
+
Cloze-style multiple-choice QA | 39.16 | 27.98 | **41.87**
|
42 |
+
Article Genre Classification | 90.63 | 97.03 | **97.34**
|
43 |
+
Named Entity Recognition (F1-score) | **73.24** | 65.93 | 64.47
|
44 |
+
Cross-Lingual Sentence Retrieval Task | 21.46 | 13.74 | **27.12**
|
45 |
+
Average | 64.62 | 61.09 | **66.66**
|
46 |
+
|
47 |
+
#### Additional Tasks
|
48 |
+
|
49 |
+
|
50 |
+
Task | Task Type | mBERT | XLM-R | IndicBERT
|
51 |
+
-----| ----- | ----- | ------ | -----
|
52 |
+
BBC News Classification | Genre Classification | 60.55 | **75.52** | 74.60
|
53 |
+
IIT Product Reviews | Sentiment Analysis | 74.57 | **78.97** | 71.32
|
54 |
+
IITP Movie Reviews | Sentiment Analaysis | 56.77 | **61.61** | 59.03
|
55 |
+
Soham News Article | Genre Classification | 80.23 | **87.6** | 78.45
|
56 |
+
Midas Discourse | Discourse Analysis | 71.20 | **79.94** | 78.44
|
57 |
+
iNLTK Headlines Classification | Genre Classification | 87.95 | 93.38 | **94.52**
|
58 |
+
ACTSA Sentiment Analysis | Sentiment Analysis | 48.53 | 59.33 | **61.18**
|
59 |
+
Winograd NLI | Natural Language Inference | 56.34 | 55.87 | **56.34**
|
60 |
+
Choice of Plausible Alternative (COPA) | Natural Language Inference | 54.92 | 51.13 | **58.33**
|
61 |
+
Amrita Exact Paraphrase | Paraphrase Detection | **93.81** | 93.02 | 93.75
|
62 |
+
Amrita Rough Paraphrase | Paraphrase Detection | 83.38 | 82.20 | **84.33**
|
63 |
+
Average | | 69.84 | **74.42** | 73.66
|
64 |
+
|
65 |
+
|
66 |
+
\* Note: all models have been restricted to a max_seq_length of 128.
|
67 |
+
|
68 |
+
|
69 |
+
|
70 |
+
## Downloads
|
71 |
+
|
72 |
+
The model can be downloaded [here](https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/models/indic-bert-v1.tar.gz). Both tf checkpoints and pytorch binaries are included in the archive. Alternatively, you can also download it from [Huggingface](https://huggingface.co/ai4bharat/indic-bert).
|
73 |
+
|
74 |
+
|
75 |
+
|
76 |
+
## Citing
|
77 |
+
|
78 |
+
If you are using any of the resources, please cite the following article:
|
79 |
+
|
80 |
+
```
|
81 |
+
@inproceedings{kakwani2020indicnlpsuite,
|
82 |
+
title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
|
83 |
+
author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
|
84 |
+
year={2020},
|
85 |
+
booktitle={Findings of EMNLP},
|
86 |
+
}
|
87 |
+
```
|
88 |
+
|
89 |
+
We would like to hear from you if:
|
90 |
+
|
91 |
+
- You are using our resources. Please let us know how you are putting these resources to use.
|
92 |
+
- You have any feedback on these resources.
|
93 |
+
|
94 |
+
|
95 |
+
|
96 |
+
## License
|
97 |
+
|
98 |
+
The IndicBERT code (and models) are released under the MIT License.
|
99 |
+
|
100 |
+
## Contributors
|
101 |
+
|
102 |
+
- Divyanshu Kakwani
|
103 |
+
- Anoop Kunchukuttan
|
104 |
+
- Gokul NC
|
105 |
+
- Satish Golla
|
106 |
+
- Avik Bhattacharyya
|
107 |
+
- Mitesh Khapra
|
108 |
+
- Pratyush Kumar
|
109 |
+
|
110 |
+
This work is the outcome of a volunteer effort as part of [AI4Bharat initiative](https://ai4bharat.org).
|
111 |
+
|
112 |
+
|
113 |
+
|
114 |
+
## Contact
|
115 |
+
|
116 |
+
- Anoop Kunchukuttan ([anoop.kunchukuttan@gmail.com](mailto:anoop.kunchukuttan@gmail.com))
|
117 |
+
- Mitesh Khapra ([miteshk@cse.iitm.ac.in](mailto:miteshk@cse.iitm.ac.in))
|
118 |
+
- Pratyush Kumar ([pratyush@cse.iitm.ac.in](mailto:pratyush@cse.iitm.ac.in))
|
config.json
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"model_type": "albert",
|
3 |
+
"attention_probs_dropout_prob": 0,
|
4 |
+
"hidden_act": "gelu",
|
5 |
+
"hidden_dropout_prob": 0,
|
6 |
+
"embedding_size": 128,
|
7 |
+
"hidden_size": 768,
|
8 |
+
"initializer_range": 0.02,
|
9 |
+
"intermediate_size": 3072,
|
10 |
+
"max_position_embeddings": 512,
|
11 |
+
"num_attention_heads": 12,
|
12 |
+
"num_hidden_layers": 12,
|
13 |
+
"num_hidden_groups": 1,
|
14 |
+
"net_structure_type": 0,
|
15 |
+
"gap_size": 0,
|
16 |
+
"num_memory_blocks": 0,
|
17 |
+
"inner_group_num": 1,
|
18 |
+
"down_scale_factor": 1,
|
19 |
+
"type_vocab_size": 2,
|
20 |
+
"vocab_size": 200000
|
21 |
+
}
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:94c747585b0126ba2886423f69844e03cb7a4f198f6e15c42afedbccbf80a138
|
3 |
+
size 134982446
|
spiece.model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:3a1173c2b6e144a02c001e289a05b5dbefddf247c50d4dcf42633158b2968fcb
|
3 |
+
size 5646064
|
spiece.vocab
ADDED
The diff for this file is too large to render.
See raw diff
|
tf_model.ckpt.data-00000-of-00001
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:51a52801f044b66e11ce318b2d20463b1ef0a63723cba2228677306bfa65aa8e
|
3 |
+
size 400182536
|
tf_model.ckpt.index
ADDED
Binary file (1.87 kB). View file
|
tf_model.ckpt.meta
ADDED
Binary file (2.2 MB). View file
|