sumanthd commited on
Commit
caab20f
1 Parent(s): 149020e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +96 -0
README.md ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - as
4
+ - bn
5
+ - brx
6
+ - doi
7
+ - en
8
+ - gom
9
+ - gu
10
+ - hi
11
+ - kn
12
+ - ks
13
+ - kas
14
+ - mai
15
+ - ml
16
+ - mr
17
+ - mni
18
+ - mnb
19
+ - ne
20
+ - or
21
+ - pa
22
+ - sa
23
+ - sat
24
+ - sd
25
+ - snd
26
+ - ta
27
+ - te
28
+ - ur
29
+ language_details: >-
30
+ asm_Beng, ben_Beng, brx_Deva, doi_Deva, eng_Latn, gom_Deva, guj_Gujr,
31
+ hin_Deva, kan_Knda, kas_Arab, kas_Deva, mai_Deva, mal_Mlym, mar_Deva,
32
+ mni_Beng, mni_Mtei, npi_Deva, ory_Orya, pan_Guru, san_Deva, sat_Olck,
33
+ snd_Arab, snd_Deva, tam_Taml, tel_Telu, urd_Arab
34
+ tags:
35
+ - indicbert2
36
+ - ai4bharat
37
+ - multilingual
38
+ license: mit
39
+ metrics:
40
+ - accuracy
41
+ pipeline_tag: fill-mask
42
+ ---
43
+ # IndicBERT
44
+ A multilingual language model trained on IndicCorp v2 and evaluated on IndicXTREME benchmark. The model has 278M parameters and is available in 23 Indic languages and English. The models are trained with various objectives and datasets. The list of models are as follows:
45
+
46
+ - IndicBERT-MLM [[Model](https://huggingface.co/ai4bharat/IndicBERTv2-MLM-only)] - A vanilla BERT style model trained on IndicCorp v2 with the MLM objective
47
+ - +Samanantar [[Model](https://huggingface.co/ai4bharat/IndicBERTv2-MLM-Sam-TLM)] - TLM as an additional objective with Samanantar Parallel Corpus [[Paper](https://aclanthology.org/2022.tacl-1.9)] | [[Dataset](https://huggingface.co/datasets/ai4bharat/samanantar)]
48
+ - +Back-Translation [[Model](https://huggingface.co/ai4bharat/IndicBERTv2-MLM-Back-TLM)] - TLM as an additional objective by translating the Indic parts of IndicCorp v2 dataset into English w/ IndicTrans model [[Model](https://github.com/AI4Bharat/indicTrans#download-model)]
49
+ - IndicBERT-SS [[Model](https://huggingface.co/ai4bharat/IndicBERTv2-SS)] - To encourage better lexical sharing among languages we convert the scripts from Indic languages to Devanagari and train a BERT style model with the MLM objective
50
+
51
+ ## Run Fine-tuning
52
+ Fine-tuning scripts are based on transformers library. Create a new conda environment and set it up as follows:
53
+ ```shell
54
+ conda create -n finetuning python=3.9
55
+ pip install -r requirements.txt
56
+ ```
57
+
58
+ All the tasks follow the same structure, please check individual files for detailed hyper-parameter choices. The following command runs the fine-tuning for a task:
59
+ ```shell
60
+ python IndicBERT/fine-tuning/$TASK_NAME/$TASK_NAME.py \
61
+ --model_name_or_path=$MODEL_NAME \
62
+ --do_train
63
+ ```
64
+ Arguments:
65
+ - MODEL_NAME: name of the model to fine-tune, can be a local path or a model from the [HuggingFace Model Hub](https://huggingface.co/models)
66
+ - TASK_NAME: one of [`ner, paraphrase, qa, sentiment, xcopa, xnli, flores`]
67
+
68
+ > For MASSIVE task, please use the instrction provided in the [official repository](https://github.com/alexa/massive)
69
+
70
+
71
+ ## Citation
72
+
73
+ ```
74
+ @inproceedings{doddapaneni-etal-2023-towards,
75
+ title = "Towards Leaving No {I}ndic Language Behind: Building Monolingual Corpora, Benchmark and Models for {I}ndic Languages",
76
+ author = "Doddapaneni, Sumanth and
77
+ Aralikatte, Rahul and
78
+ Ramesh, Gowtham and
79
+ Goyal, Shreya and
80
+ Khapra, Mitesh M. and
81
+ Kunchukuttan, Anoop and
82
+ Kumar, Pratyush",
83
+ editor = "Rogers, Anna and
84
+ Boyd-Graber, Jordan and
85
+ Okazaki, Naoaki",
86
+ booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
87
+ month = jul,
88
+ year = "2023",
89
+ address = "Toronto, Canada",
90
+ publisher = "Association for Computational Linguistics",
91
+ url = "https://aclanthology.org/2023.acl-long.693",
92
+ doi = "10.18653/v1/2023.acl-long.693",
93
+ pages = "12402--12426",
94
+ abstract = "Building Natural Language Understanding (NLU) capabilities for Indic languages, which have a collective speaker base of more than one billion speakers is absolutely crucial. In this work, we aim to improve the NLU capabilities of Indic languages by making contributions along 3 important axes (i) monolingual corpora (ii) NLU testsets (iii) multilingual LLMs focusing on Indic languages. Specifically, we curate the largest monolingual corpora, IndicCorp, with 20.9B tokens covering 24 languages from 4 language families - a 2.3x increase over prior work, while supporting 12 additional languages. Next, we create a human-supervised benchmark, IndicXTREME, consisting of nine diverse NLU tasks covering 20 languages. Across languages and tasks, IndicXTREME contains a total of 105 evaluation sets, of which 52 are new contributions to the literature. To the best of our knowledge, this is the first effort towards creating a standard benchmark for Indic languages that aims to test the multilingual zero-shot capabilities of pretrained language models. Finally, we train IndicBERT v2, a state-of-the-art model supporting all the languages. Averaged across languages and tasks, the model achieves an absolute improvement of 2 points over a strong baseline. The data and models are available at \url{https://github.com/AI4Bharat/IndicBERT}.",
95
+ }
96
+ ```