--- language: - as - bn - gu - hi - kn - ml - mr - or - pa - ta - te license: mit datasets: - Samanantar tags: - ner - Pytorch - transformer - multilingual - nlp - indicnlp --- # IndicNER IndicNER is a model trained to complete the task of identifying named entities from sentences in Indian languages. Our model is specifically fine-tuned to the 11 Indian languages mentioned above over millions of sentences. The model is then benchmarked over a human annotated testset and multiple other publicly available Indian NER datasets. The 11 languages covered by IndicBERT are: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. The link to our GitHub repository containing all our code can be found [here](https://github.com/AI4Bharat/indicner). The link to our paper can be found here. ## Training Corpus Our model was trained on a [dataset](https://huggingface.co/datasets/ai4bharat/naamapadam) which we mined from the existing [Samanantar Corpus](https://huggingface.co/datasets/ai4bharat/samanantar). We used a bert-base-multilingual-uncased model as the starting point and then fine-tuned it to the NER dataset mentioned previously. ## Evaluation Results Benchmarking on our testset. Language | bn | hi | kn | ml | mr | gu | ta | te | as | or | pa -----| ----- | ----- | ------ | -----| ----- | ----- | ------ | -----| ----- | ----- | ------ F1 score | 79.75 | 82.33 | 80.01 | 80.73 | 80.51 | 73.82 | 80.98 | 80.88 | 62.50 | 27.05 | 74.88 The first 5 languages (bn, hi, kn, ml, mr ) have large human annotated testsets consisting of around 500-1000 sentences. The next 3 (gu, ta, te) have smaller human annotated testsets with only around 50 sentences. The final 3 (as, or, pa) languages have mined projected testsets not supervised by humans. ## Downloads Download from this same Huggingface repo. ## Citing If you are using IndicNER, please cite the following article: ``` @misc{mhaske2022indicner, title={Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages}, author={Arnav Mhaske, Harshit Kedia, Rudramurthy. V, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Khapra}, year={2022}, eprint={to be published soon}, } ``` We would like to hear from you if: - You are using our resources. Please let us know how you are putting these resources to use. - You have any feedback on these resources. ## License The IndicNER code (and models) are released under the MIT License. ## Contributors - Arnav Mhaske ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) - Harshit Kedia ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) - Anoop Kunchukuttan ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/)) - Rudra Murthy ([AI4Bharat](https://ai4bharat.org), [IBM](https://www.ibm.com)) - Pratyush Kumar ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/), [IITM](https://www.iitm.ac.in)) - Mitesh M. Khapra ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) This work is the outcome of a volunteer effort as part of [AI4Bharat initiative](https://ai4bharat.org). ## Contact - Anoop Kunchukuttan ([anoop.kunchukuttan@gmail.com](mailto:anoop.kunchukuttan@gmail.com)) - Mitesh Khapra ([miteshk@cse.iitm.ac.in](mailto:miteshk@cse.iitm.ac.in)) - Pratyush Kumar ([pratyush@cse.iitm.ac.in](mailto:pratyush@cse.iitm.ac.in))