metadata

language:
  - as
  - bn
  - gu
  - hi
  - kn
  - ml
  - mr
  - or
  - pa
  - ta
  - te
license: mit
datasets:
  - Samanantar
tags:
  - ner
  - Pytorch
  - transformer
  - multilingual
  - nlp
  - indicnlp

IndicNER

IndicNER is a model trained to complete the task of identifying named entities from sentences in Indian languages. Our model is specifically fine-tuned to the 11 Indian languages mentioned above over millions of sentences. The model is then benchmarked over a human annotated testset and multiple other publicly available Indian NER datasets. The 11 languages covered by IndicBERT are: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. The link to our GitHub repository containing all our code can be found here. The link to our paper can be found here.

Training Corpus

Our model was trained on a dataset which we mined from the existing Samanantar Corpus. We used a bert-base-multilingual-uncased model as the starting point and then fine-tuned it to the NER dataset mentioned previously.

Evaluation Results

Benchmarking on our testset.

Language	bn	hi	kn	ml	mr	gu	ta	te	as	or	pa
F1 score	79.75	82.33	80.01	80.73	80.51	73.82	80.98	80.88	62.50	27.05	74.88

The first 5 languages (bn, hi, kn, ml, mr ) have large human annotated testsets consisting of around 500-1000 sentences. The next 3 (gu, ta, te) have smaller human annotated testsets with only around 50 sentences. The final 3 (as, or, pa) languages have mined projected testsets not supervised by humans.

Downloads

Download from this same Huggingface repo.

Citing

If you are using IndicNER, please cite the following article:

@misc{mhaske2022indicner,
      title={Naamapadam: A Large-Scale Named Entity Annotated Data for Indic
Languages}, 
      author={Arnav Mhaske, Harshit Kedia, Rudramurthy. V, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Khapra},
      year={2022},
      eprint={to be published soon},
    }

We would like to hear from you if:

You are using our resources. Please let us know how you are putting these resources to use.
You have any feedback on these resources.

License

The IndicNER code (and models) are released under the MIT License.

Contributors

Arnav Mhaske _{(AI4Bharat, IITM)}
Harshit Kedia _{(AI4Bharat, IITM)}
Anoop Kunchukuttan _{(AI4Bharat, Microsoft)}
Rudra Murthy _{(AI4Bharat, IBM)}
Pratyush Kumar _{(AI4Bharat, Microsoft, IITM)}
Mitesh M. Khapra _{(AI4Bharat, IITM)}

This work is the outcome of a volunteer effort as part of AI4Bharat initiative.

Contact

Anoop Kunchukuttan (anoop.kunchukuttan@gmail.com)
Mitesh Khapra (miteshk@cse.iitm.ac.in)
Pratyush Kumar (pratyush@cse.iitm.ac.in)