Arnav14 commited on
Commit
88986de
1 Parent(s): c5f9b8f

Updated Model Card

Browse files
Files changed (1) hide show
  1. README.md +11 -8
README.md CHANGED
@@ -24,25 +24,28 @@ tags:
24
  ---
25
 
26
  # IndicNER
27
-
28
-
 
29
 
30
  ## Training Corpus
31
-
32
 
33
  ## Evaluation Results
 
34
 
 
 
 
35
 
 
36
 
37
 
38
  ## Downloads
 
39
 
40
 
41
 
42
- ## Network and training details
43
- <!-- network and training details and link to the paper -->
44
-
45
-
46
 
47
 
48
  <!-- citing information -->
@@ -70,7 +73,7 @@ The IndicNER code (and models) are released under the MIT License.
70
  - Arnav Mhaske <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
71
  - Harshit Kedia <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
72
  - Anoop Kunchukuttan <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/)) </sub>
73
- - Rudra Murthy <sub> ([AI4Bharat](https://ai4bharat.org), [IBM]())</sub>
74
  - Pratyush Kumar <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/), [IITM](https://www.iitm.ac.in)) </sub>
75
  - Mitesh M. Khapra <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
76
 
 
24
  ---
25
 
26
  # IndicNER
27
+ IndicNER is a model trained to complete the task of identifying named entities from sentences in Indian languages. Our model is specifically fine-tuned to the 11 Indian languages mentioned above over millions of sentences. The model is then benchmarked over a human annotated testset and multiple other publicly available Indian NER datasets.
28
+ The 11 languages covered by IndicBERT are: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
29
+ The link to our GitHub repository containing all our code can be found [here](https://github.com/AI4Bharat/indicner). The link to our paper can be found here.
30
 
31
  ## Training Corpus
32
+ Our model was trained on a [dataset](https://huggingface.co/datasets/ai4bharat/IndicNER) which we mined from the existing [Samanantar Corpus](https://huggingface.co/datasets/ai4bharat/samanantar). We used a bert-base-multilingual-uncased model as the starting point and then fine-tuned it to the NER dataset mentioned previously.
33
 
34
  ## Evaluation Results
35
+ Benchmarking on our testset.
36
 
37
+ Language | bn | hi | kn | ml | mr | gu | ta | te | as | or | pa
38
+ -----| ----- | ----- | ------ | -----| ----- | ----- | ------ | -----| ----- | ----- | ------
39
+ F1 score | 79.75 | 82.33 | 80.01 | 80.73 | 80.51 | 73.82 | 80.98 | 80.88 | 62.50 | 27.05 | 74.88
40
 
41
+ The first 5 languages (bn, hi, kn, ml, mr ) have large human annotated testsets consisting of around 500-1000 sentences. The next 3 (gu, ta, te) have smaller human annotated testsets with only around 50 sentences. The final 3 (as, or, pa) languages have mined projected testsets not supervised by humans.
42
 
43
 
44
  ## Downloads
45
+ Download from this same Huggingface repo.
46
 
47
 
48
 
 
 
 
 
49
 
50
 
51
  <!-- citing information -->
 
73
  - Arnav Mhaske <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
74
  - Harshit Kedia <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
75
  - Anoop Kunchukuttan <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/)) </sub>
76
+ - Rudra Murthy <sub> ([AI4Bharat](https://ai4bharat.org), [IBM](https://www.ibm.com))</sub>
77
  - Pratyush Kumar <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/), [IITM](https://www.iitm.ac.in)) </sub>
78
  - Mitesh M. Khapra <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
79