julien-c HF staff commited on
Commit
09d164a
1 Parent(s): eab93f5

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/ai4bharat/indic-bert/README.md

Files changed (1) hide show
  1. README.md +118 -0
README.md ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ datasets:
5
+ - AI4Bharat IndicNLP Corpora
6
+ ---
7
+
8
+ # IndicBERT
9
+
10
+ IndicBERT is a multilingual ALBERT model pretrained exclusively on 12 major Indian languages. It is pre-trained on our novel monolingual corpus of around 9 billion tokens and subsequently evaluated on a set of diverse tasks. IndicBERT has much fewer parameters than other multilingual models (mBERT, XLM-R etc.) while it also achieves a performance on-par or better than these models.
11
+
12
+ The 12 languages covered by IndicBERT are: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
13
+
14
+ The code can be found [here](https://github.com/divkakwani/indic-bert). For more information, checkout our [project page](https://indicnlp.ai4bharat.org/) or our [paper](https://indicnlp.ai4bharat.org/papers/arxiv2020_indicnlp_corpus.pdf).
15
+
16
+
17
+
18
+ ## Pretraining Corpus
19
+
20
+ We pre-trained indic-bert on AI4Bharat's monolingual corpus. The corpus has the following distribution of languages:
21
+
22
+
23
+ | Language | as | bn | en | gu | hi | kn | |
24
+ | ----------------- | ------ | ------ | ------ | ------ | ------ | ------ | ------- |
25
+ | **No. of Tokens** | 36.9M | 815M | 1.34B | 724M | 1.84B | 712M | |
26
+ | **Language** | **ml** | **mr** | **or** | **pa** | **ta** | **te** | **all** |
27
+ | **No. of Tokens** | 767M | 560M | 104M | 814M | 549M | 671M | 8.9B |
28
+
29
+
30
+
31
+ ## Evaluation Results
32
+
33
+ IndicBERT is evaluated on IndicGLUE and some additional tasks. The results are summarized below. For more details about the tasks, refer our [official repo](https://github.com/divkakwani/indic-bert)
34
+
35
+ #### IndicGLUE
36
+
37
+ Task | mBERT | XLM-R | IndicBERT
38
+ -----| ----- | ----- | ------
39
+ News Article Headline Prediction | 89.58 | 95.52 | **95.87**
40
+ Wikipedia Section Title Prediction| **73.66** | 66.33 | 73.31
41
+ Cloze-style multiple-choice QA | 39.16 | 27.98 | **41.87**
42
+ Article Genre Classification | 90.63 | 97.03 | **97.34**
43
+ Named Entity Recognition (F1-score) | **73.24** | 65.93 | 64.47
44
+ Cross-Lingual Sentence Retrieval Task | 21.46 | 13.74 | **27.12**
45
+ Average | 64.62 | 61.09 | **66.66**
46
+
47
+ #### Additional Tasks
48
+
49
+
50
+ Task | Task Type | mBERT | XLM-R | IndicBERT
51
+ -----| ----- | ----- | ------ | -----
52
+ BBC News Classification | Genre Classification | 60.55 | **75.52** | 74.60
53
+ IIT Product Reviews | Sentiment Analysis | 74.57 | **78.97** | 71.32
54
+ IITP Movie Reviews | Sentiment Analaysis | 56.77 | **61.61** | 59.03
55
+ Soham News Article | Genre Classification | 80.23 | **87.6** | 78.45
56
+ Midas Discourse | Discourse Analysis | 71.20 | **79.94** | 78.44
57
+ iNLTK Headlines Classification | Genre Classification | 87.95 | 93.38 | **94.52**
58
+ ACTSA Sentiment Analysis | Sentiment Analysis | 48.53 | 59.33 | **61.18**
59
+ Winograd NLI | Natural Language Inference | 56.34 | 55.87 | **56.34**
60
+ Choice of Plausible Alternative (COPA) | Natural Language Inference | 54.92 | 51.13 | **58.33**
61
+ Amrita Exact Paraphrase | Paraphrase Detection | **93.81** | 93.02 | 93.75
62
+ Amrita Rough Paraphrase | Paraphrase Detection | 83.38 | 82.20 | **84.33**
63
+ Average | | 69.84 | **74.42** | 73.66
64
+
65
+
66
+ \* Note: all models have been restricted to a max_seq_length of 128.
67
+
68
+
69
+
70
+ ## Downloads
71
+
72
+ The model can be downloaded [here](https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/models/indic-bert-v1.tar.gz). Both tf checkpoints and pytorch binaries are included in the archive. Alternatively, you can also download it from [Huggingface](https://huggingface.co/ai4bharat/indic-bert).
73
+
74
+
75
+
76
+ ## Citing
77
+
78
+ If you are using any of the resources, please cite the following article:
79
+
80
+ ```
81
+ @inproceedings{kakwani2020indicnlpsuite,
82
+ title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
83
+ author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
84
+ year={2020},
85
+ booktitle={Findings of EMNLP},
86
+ }
87
+ ```
88
+
89
+ We would like to hear from you if:
90
+
91
+ - You are using our resources. Please let us know how you are putting these resources to use.
92
+ - You have any feedback on these resources.
93
+
94
+
95
+
96
+ ## License
97
+
98
+ The IndicBERT code (and models) are released under the MIT License.
99
+
100
+ ## Contributors
101
+
102
+ - Divyanshu Kakwani
103
+ - Anoop Kunchukuttan
104
+ - Gokul NC
105
+ - Satish Golla
106
+ - Avik Bhattacharyya
107
+ - Mitesh Khapra
108
+ - Pratyush Kumar
109
+
110
+ This work is the outcome of a volunteer effort as part of [AI4Bharat initiative](https://ai4bharat.org).
111
+
112
+
113
+
114
+ ## Contact
115
+
116
+ - Anoop Kunchukuttan ([anoop.kunchukuttan@gmail.com](mailto:anoop.kunchukuttan@gmail.com))
117
+ - Mitesh Khapra ([miteshk@cse.iitm.ac.in](mailto:miteshk@cse.iitm.ac.in))
118
+ - Pratyush Kumar ([pratyush@cse.iitm.ac.in](mailto:pratyush@cse.iitm.ac.in))