IndicNER / README.md

anoopk

Update README.md

f8554ee about 2 years ago

preview code

raw

history blame

No virus

3.62 kB

	---
	language:
	- as
	- bn
	- gu
	- hi
	- kn
	- ml
	- mr
	- or
	- pa
	- ta
	- te
	license: mit
	datasets:
	- Samanantar
	tags:
	- ner
	- Pytorch
	- transformer
	- multilingual
	- nlp
	- indicnlp
	---

	# IndicNER
	IndicNER is a model trained to complete the task of identifying named entities from sentences in Indian languages. Our model is specifically fine-tuned to the 11 Indian languages mentioned above over millions of sentences. The model is then benchmarked over a human annotated testset and multiple other publicly available Indian NER datasets.
	The 11 languages covered by IndicBERT are: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
	The link to our GitHub repository containing all our code can be found [here](https://github.com/AI4Bharat/indicner). The link to our paper can be found here.

	## Training Corpus
	Our model was trained on a [dataset](https://huggingface.co/datasets/ai4bharat/naamapadam) which we mined from the existing [Samanantar Corpus](https://huggingface.co/datasets/ai4bharat/samanantar). We used a bert-base-multilingual-uncased model as the starting point and then fine-tuned it to the NER dataset mentioned previously.

	## Evaluation Results
	Benchmarking on our testset.

	Language \| bn \| hi \| kn \| ml \| mr \| gu \| ta \| te \| as \| or \| pa
	-----\| ----- \| ----- \| ------ \| -----\| ----- \| ----- \| ------ \| -----\| ----- \| ----- \| ------
	F1 score \| 79.75 \| 82.33 \| 80.01 \| 80.73 \| 80.51 \| 73.82 \| 80.98 \| 80.88 \| 62.50 \| 27.05 \| 74.88

	The first 5 languages (bn, hi, kn, ml, mr ) have large human annotated testsets consisting of around 500-1000 sentences. The next 3 (gu, ta, te) have smaller human annotated testsets with only around 50 sentences. The final 3 (as, or, pa) languages have mined projected testsets not supervised by humans.


	## Downloads
	Download from this same Huggingface repo.





	<!-- citing information -->
	## Citing

	If you are using IndicNER, please cite the following article:
	```
	@misc{mhaske2022indicner,
	title={Naamapadam: A Large-Scale Named Entity Annotated Data for Indic
	Languages},
	author={Arnav Mhaske, Harshit Kedia, Rudramurthy. V, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Khapra},
	year={2022},
	eprint={to be published soon},
	}
	```
	We would like to hear from you if:

	- You are using our resources. Please let us know how you are putting these resources to use.
	- You have any feedback on these resources.


	<!-- License -->
	## License

	The IndicNER code (and models) are released under the MIT License.



	<!-- Contributors -->
	## Contributors
	- Arnav Mhaske <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
	- Harshit Kedia <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
	- Anoop Kunchukuttan <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/)) </sub>
	- Rudra Murthy <sub> ([AI4Bharat](https://ai4bharat.org), [IBM](https://www.ibm.com))</sub>
	- Pratyush Kumar <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/), [IITM](https://www.iitm.ac.in)) </sub>
	- Mitesh M. Khapra <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>

	This work is the outcome of a volunteer effort as part of [AI4Bharat initiative](https://ai4bharat.org).


	<!-- Contact -->
	## Contact
	- Anoop Kunchukuttan ([anoop.kunchukuttan@gmail.com](mailto:anoop.kunchukuttan@gmail.com))
	- Mitesh Khapra ([miteshk@cse.iitm.ac.in](mailto:miteshk@cse.iitm.ac.in))
	- Pratyush Kumar ([pratyush@cse.iitm.ac.in](mailto:pratyush@cse.iitm.ac.in))