### As of May 2023, we recommend using [IndicBERT](https://github.com/AI4Bharat/IndicBERT) Repository:
[IndicBERT](https://github.com/AI4Bharat/IndicBERT) is the new and improved implementation of BERT supporting fine-tuning with HuggingFace.
All the download links for IndicCorpv2, IndicXTREME and various IndicBERTv2 models are available [here](https://github.com/AI4Bharat/IndicBERT).
Indic bert is a multilingual ALBERT model that exclusively covers 12 major Indian languages. It is pre-trained on our novel corpus of around 9 billion tokens and evaluated on a set of diverse tasks. Indic-bert has around 10x fewer parameters than other popular publicly available multilingual models while it also achieves a performance on-par or better than these models.
We also introduce IndicGLUE - a set of standard evaluation tasks that can be used to measure the NLU performance of monolingual and multilingual models on Indian languages. Along with IndicGLUE, we also compile a list of additional evaluation tasks. This repository contains code for running all these evaluation tasks on indic-bert and other bert-like models.
### Table of Contents
* [Introduction](#introduction)
* [Setting up the Code](#setting-up-the-code)
* [Running Experiments](#running-experiments)
* [Pretraining Corpus](#pretraining-corpus)
* [IndicGLUE](#iglue)
* [Additional Evaluation Tasks](#additional-evaluation-tasks)
* [Evaluation Results](#evaluation-results)
* [Downloads](#downloads)
* [Citing](#citing)
* [License](#license)
* [Contributors](#contributors)
* [Contact](#contact)
### Introduction
The Indic BERT model is based on the ALBERT model, a recent derivative of BERT. It is pre-trained on 12 Indian languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
The easiest way to use Indic BERT is through the Huggingface transformers library. It can be simply loaded like this:
```python
# pip3 install transformers
# pip3 install sentencepiece
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('ai4bharat/indic-bert')
model = AutoModel.from_pretrained('ai4bharat/indic-bert')
```
Note: To preserve accents (vowel matras / diacritics) while tokenization (Read this issue for more details [#26](../../issues/26) ), use this:
```python
tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert', keep_accents=True)
```
### Setting up the Code
The code can be run on GPU, TPU or on Google's Colab platform. If you want to run it on Colab, you can simply use our fine-tuning notebook [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ai4bharat/indic-bert/blob/master/notebooks/finetuning.ipynb). For running it in your own VM, start with running the following commands:
```bash
git clone https://github.com/AI4Bharat/indic-bert
cd indic-bert
sudo pip3 install -r requirements.txt
```
By default, the installation will use GPU. For TPU support, first update your `.bashrc` with the following variables:
```bash
export PYTHONPATH="${PYTHONPATH}:/usr/share/tpu/models: --dataset --lang --iglue_dir --output_dir