Updated Model Card
Browse files
README.md
CHANGED
@@ -24,25 +24,28 @@ tags:
|
|
24 |
---
|
25 |
|
26 |
# IndicNER
|
27 |
-
|
28 |
-
|
|
|
29 |
|
30 |
## Training Corpus
|
31 |
-
|
32 |
|
33 |
## Evaluation Results
|
|
|
34 |
|
|
|
|
|
|
|
35 |
|
|
|
36 |
|
37 |
|
38 |
## Downloads
|
|
|
39 |
|
40 |
|
41 |
|
42 |
-
## Network and training details
|
43 |
-
<!-- network and training details and link to the paper -->
|
44 |
-
|
45 |
-
|
46 |
|
47 |
|
48 |
<!-- citing information -->
|
@@ -70,7 +73,7 @@ The IndicNER code (and models) are released under the MIT License.
|
|
70 |
- Arnav Mhaske <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
|
71 |
- Harshit Kedia <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
|
72 |
- Anoop Kunchukuttan <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/)) </sub>
|
73 |
-
- Rudra Murthy <sub> ([AI4Bharat](https://ai4bharat.org), [IBM]())</sub>
|
74 |
- Pratyush Kumar <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/), [IITM](https://www.iitm.ac.in)) </sub>
|
75 |
- Mitesh M. Khapra <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
|
76 |
|
|
|
24 |
---
|
25 |
|
26 |
# IndicNER
|
27 |
+
IndicNER is a model trained to complete the task of identifying named entities from sentences in Indian languages. Our model is specifically fine-tuned to the 11 Indian languages mentioned above over millions of sentences. The model is then benchmarked over a human annotated testset and multiple other publicly available Indian NER datasets.
|
28 |
+
The 11 languages covered by IndicBERT are: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
|
29 |
+
The link to our GitHub repository containing all our code can be found [here](https://github.com/AI4Bharat/indicner). The link to our paper can be found here.
|
30 |
|
31 |
## Training Corpus
|
32 |
+
Our model was trained on a [dataset](https://huggingface.co/datasets/ai4bharat/IndicNER) which we mined from the existing [Samanantar Corpus](https://huggingface.co/datasets/ai4bharat/samanantar). We used a bert-base-multilingual-uncased model as the starting point and then fine-tuned it to the NER dataset mentioned previously.
|
33 |
|
34 |
## Evaluation Results
|
35 |
+
Benchmarking on our testset.
|
36 |
|
37 |
+
Language | bn | hi | kn | ml | mr | gu | ta | te | as | or | pa
|
38 |
+
-----| ----- | ----- | ------ | -----| ----- | ----- | ------ | -----| ----- | ----- | ------
|
39 |
+
F1 score | 79.75 | 82.33 | 80.01 | 80.73 | 80.51 | 73.82 | 80.98 | 80.88 | 62.50 | 27.05 | 74.88
|
40 |
|
41 |
+
The first 5 languages (bn, hi, kn, ml, mr ) have large human annotated testsets consisting of around 500-1000 sentences. The next 3 (gu, ta, te) have smaller human annotated testsets with only around 50 sentences. The final 3 (as, or, pa) languages have mined projected testsets not supervised by humans.
|
42 |
|
43 |
|
44 |
## Downloads
|
45 |
+
Download from this same Huggingface repo.
|
46 |
|
47 |
|
48 |
|
|
|
|
|
|
|
|
|
49 |
|
50 |
|
51 |
<!-- citing information -->
|
|
|
73 |
- Arnav Mhaske <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
|
74 |
- Harshit Kedia <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
|
75 |
- Anoop Kunchukuttan <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/)) </sub>
|
76 |
+
- Rudra Murthy <sub> ([AI4Bharat](https://ai4bharat.org), [IBM](https://www.ibm.com))</sub>
|
77 |
- Pratyush Kumar <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/), [IITM](https://www.iitm.ac.in)) </sub>
|
78 |
- Mitesh M. Khapra <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
|
79 |
|