language:
- as
- bn
- gu
- hi
- kn
- ml
- mr
- or
- pa
- ta
- te
license: mit
datasets:
- Samanantar
tags:
- ner
- Pytorch
- transformer
- multilingual
- nlp
- indicnlp
IndicNER
IndicNER is a model trained to complete the task of identifying named entities from sentences in Indian languages. Our model is specifically fine-tuned to the 11 Indian languages mentioned above over millions of sentences. The model is then benchmarked over a human annotated testset and multiple other publicly available Indian NER datasets. The 11 languages covered by IndicBERT are: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. The link to our GitHub repository containing all our code can be found here. The link to our paper can be found here.
Training Corpus
Our model was trained on a dataset which we mined from the existing Samanantar Corpus. We used a bert-base-multilingual-uncased model as the starting point and then fine-tuned it to the NER dataset mentioned previously.
Evaluation Results
Benchmarking on our testset.
Language | bn | hi | kn | ml | mr | gu | ta | te | as | or | pa |
---|---|---|---|---|---|---|---|---|---|---|---|
F1 score | 79.75 | 82.33 | 80.01 | 80.73 | 80.51 | 73.82 | 80.98 | 80.88 | 62.50 | 27.05 | 74.88 |
The first 5 languages (bn, hi, kn, ml, mr ) have large human annotated testsets consisting of around 500-1000 sentences. The next 3 (gu, ta, te) have smaller human annotated testsets with only around 50 sentences. The final 3 (as, or, pa) languages have mined projected testsets not supervised by humans.
Downloads
Download from this same Huggingface repo.
Citing
If you are using IndicNER, please cite the following article:
@misc{mhaske2022indicner,
title={Naamapadam: A Large-Scale Named Entity Annotated Data for Indic
Languages},
author={Arnav Mhaske, Harshit Kedia, Rudramurthy. V, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Khapra},
year={2022},
eprint={to be published soon},
}
We would like to hear from you if:
- You are using our resources. Please let us know how you are putting these resources to use.
- You have any feedback on these resources.
License
The IndicNER code (and models) are released under the MIT License.
Contributors
- Arnav Mhaske (AI4Bharat, IITM)
- Harshit Kedia (AI4Bharat, IITM)
- Anoop Kunchukuttan (AI4Bharat, Microsoft)
- Rudra Murthy (AI4Bharat, IBM)
- Pratyush Kumar (AI4Bharat, Microsoft, IITM)
- Mitesh M. Khapra (AI4Bharat, IITM)
This work is the outcome of a volunteer effort as part of AI4Bharat initiative.
Contact
- Anoop Kunchukuttan (anoop.kunchukuttan@gmail.com)
- Mitesh Khapra (miteshk@cse.iitm.ac.in)
- Pratyush Kumar (pratyush@cse.iitm.ac.in)