IndicBERT

IndicBERT is a multilingual ALBERT model pretrained exclusively on 12 major Indian languages. It is pre-trained on our novel monolingual corpus of around 9 billion tokens and subsequently evaluated on a set of diverse tasks. IndicBERT has much fewer parameters than other multilingual models (mBERT, XLM-R etc.) while it also achieves a performance on-par or better than these models.

The 12 languages covered by IndicBERT are: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.

The code can be found here. For more information, checkout our project page or our paper.

Pretraining Corpus

We pre-trained indic-bert on AI4Bharat's monolingual corpus. The corpus has the following distribution of languages:

Language	as	bn	en	gu	hi	kn
No. of Tokens	36.9M	815M	1.34B	724M	1.84B	712M
Language	ml	mr	or	pa	ta	te	all
No. of Tokens	767M	560M	104M	814M	549M	671M	8.9B

Evaluation Results

IndicBERT is evaluated on IndicGLUE and some additional tasks. The results are summarized below. For more details about the tasks, refer our official repo

IndicGLUE

Task	mBERT	XLM-R	IndicBERT
News Article Headline Prediction	89.58	95.52	95.87
Wikipedia Section Title Prediction	73.66	66.33	73.31
Cloze-style multiple-choice QA	39.16	27.98	41.87
Article Genre Classification	90.63	97.03	97.34
Named Entity Recognition (F1-score)	73.24	65.93	64.47
Cross-Lingual Sentence Retrieval Task	21.46	13.74	27.12
Average	64.62	61.09	66.66

Additional Tasks

Task	Task Type	mBERT	XLM-R	IndicBERT
BBC News Classification	Genre Classification	60.55	75.52	74.60
IIT Product Reviews	Sentiment Analysis	74.57	78.97	71.32
IITP Movie Reviews	Sentiment Analaysis	56.77	61.61	59.03
Soham News Article	Genre Classification	80.23	87.6	78.45
Midas Discourse	Discourse Analysis	71.20	79.94	78.44
iNLTK Headlines Classification	Genre Classification	87.95	93.38	94.52
ACTSA Sentiment Analysis	Sentiment Analysis	48.53	59.33	61.18
Winograd NLI	Natural Language Inference	56.34	55.87	56.34
Choice of Plausible Alternative (COPA)	Natural Language Inference	54.92	51.13	58.33
Amrita Exact Paraphrase	Paraphrase Detection	93.81	93.02	93.75
Amrita Rough Paraphrase	Paraphrase Detection	83.38	82.20	84.33
Average		69.84	74.42	73.66

* Note: all models have been restricted to a max_seq_length of 128.

Downloads

The model can be downloaded here. Both tf checkpoints and pytorch binaries are included in the archive. Alternatively, you can also download it from Huggingface.

Citing

If you are using any of the resources, please cite the following article:

@inproceedings{kakwani2020indicnlpsuite,
    title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
    author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
    year={2020},
    booktitle={Findings of EMNLP},
}

We would like to hear from you if:

You are using our resources. Please let us know how you are putting these resources to use.
You have any feedback on these resources.

License

The IndicBERT code (and models) are released under the MIT License.

Contributors

Divyanshu Kakwani
Anoop Kunchukuttan
Gokul NC
Satish Golla
Avik Bhattacharyya
Mitesh Khapra
Pratyush Kumar

This work is the outcome of a volunteer effort as part of AI4Bharat initiative.

Contact

Anoop Kunchukuttan (anoop.kunchukuttan@gmail.com)
Mitesh Khapra (miteshk@cse.iitm.ac.in)
Pratyush Kumar (pratyush@cse.iitm.ac.in)