Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- as
|
4 |
+
- bn
|
5 |
+
- brx
|
6 |
+
- doi
|
7 |
+
- en
|
8 |
+
- gom
|
9 |
+
- gu
|
10 |
+
- hi
|
11 |
+
- kn
|
12 |
+
- ks
|
13 |
+
- kas
|
14 |
+
- mai
|
15 |
+
- ml
|
16 |
+
- mr
|
17 |
+
- mni
|
18 |
+
- mnb
|
19 |
+
- ne
|
20 |
+
- or
|
21 |
+
- pa
|
22 |
+
- sa
|
23 |
+
- sat
|
24 |
+
- sd
|
25 |
+
- snd
|
26 |
+
- ta
|
27 |
+
- te
|
28 |
+
- ur
|
29 |
+
language_details: >-
|
30 |
+
asm_Beng, ben_Beng, brx_Deva, doi_Deva, eng_Latn, gom_Deva, guj_Gujr,
|
31 |
+
hin_Deva, kan_Knda, kas_Arab, kas_Deva, mai_Deva, mal_Mlym, mar_Deva,
|
32 |
+
mni_Beng, mni_Mtei, npi_Deva, ory_Orya, pan_Guru, san_Deva, sat_Olck,
|
33 |
+
snd_Arab, snd_Deva, tam_Taml, tel_Telu, urd_Arab
|
34 |
+
tags:
|
35 |
+
- indicbert2
|
36 |
+
- ai4bharat
|
37 |
+
- multilingual
|
38 |
+
license: mit
|
39 |
+
metrics:
|
40 |
+
- accuracy
|
41 |
+
pipeline_tag: fill-mask
|
42 |
+
---
|
43 |
+
# IndicBERT
|
44 |
+
A multilingual language model trained on IndicCorp v2 and evaluated on IndicXTREME benchmark. The model has 278M parameters and is available in 23 Indic languages and English. The models are trained with various objectives and datasets. The list of models are as follows:
|
45 |
+
|
46 |
+
- IndicBERT-MLM [[Model](https://huggingface.co/ai4bharat/IndicBERTv2-MLM-only)] - A vanilla BERT style model trained on IndicCorp v2 with the MLM objective
|
47 |
+
- +Samanantar [[Model](https://huggingface.co/ai4bharat/IndicBERTv2-MLM-Sam-TLM)] - TLM as an additional objective with Samanantar Parallel Corpus [[Paper](https://aclanthology.org/2022.tacl-1.9)] | [[Dataset](https://huggingface.co/datasets/ai4bharat/samanantar)]
|
48 |
+
- +Back-Translation [[Model](https://huggingface.co/ai4bharat/IndicBERTv2-MLM-Back-TLM)] - TLM as an additional objective by translating the Indic parts of IndicCorp v2 dataset into English w/ IndicTrans model [[Model](https://github.com/AI4Bharat/indicTrans#download-model)]
|
49 |
+
- IndicBERT-SS [[Model](https://huggingface.co/ai4bharat/IndicBERTv2-SS)] - To encourage better lexical sharing among languages we convert the scripts from Indic languages to Devanagari and train a BERT style model with the MLM objective
|
50 |
+
|
51 |
+
## Run Fine-tuning
|
52 |
+
Fine-tuning scripts are based on transformers library. Create a new conda environment and set it up as follows:
|
53 |
+
```shell
|
54 |
+
conda create -n finetuning python=3.9
|
55 |
+
pip install -r requirements.txt
|
56 |
+
```
|
57 |
+
|
58 |
+
All the tasks follow the same structure, please check individual files for detailed hyper-parameter choices. The following command runs the fine-tuning for a task:
|
59 |
+
```shell
|
60 |
+
python IndicBERT/fine-tuning/$TASK_NAME/$TASK_NAME.py \
|
61 |
+
--model_name_or_path=$MODEL_NAME \
|
62 |
+
--do_train
|
63 |
+
```
|
64 |
+
Arguments:
|
65 |
+
- MODEL_NAME: name of the model to fine-tune, can be a local path or a model from the [HuggingFace Model Hub](https://huggingface.co/models)
|
66 |
+
- TASK_NAME: one of [`ner, paraphrase, qa, sentiment, xcopa, xnli, flores`]
|
67 |
+
|
68 |
+
> For MASSIVE task, please use the instrction provided in the [official repository](https://github.com/alexa/massive)
|