schackartk commited on
Commit
d149041
1 Parent(s): 0e5aaef

Add model card for NER model

Browse files
Files changed (1) hide show
  1. named_entity_recognition_modelcard.md +155 -0
named_entity_recognition_modelcard.md ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ # Model Card for named_entity_recognition.pt
6
+
7
+ This is a fine-tuned model checkpoint for the named entity recognition (NER) task used in the biodata resource inventory performed by the
8
+ [Global Biodata Coalition](https://globalbiodata.org/) in collaboration with [Chan Zuckerberg Initiative](https://chanzuckerberg.com/).
9
+
10
+ # Model Details
11
+
12
+ ## Model Description
13
+
14
+ This model has been fine-tuned to detect resource names in scientific articles (title and abstract). This is done using a token classification which assigns predicted
15
+ token labels following the [BIO scheme](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). These are post-processed to determine the
16
+ predicted "common names" (often an acronym) and "full names" of a resource present in an article.
17
+
18
+
19
+
20
+ - **Developed by:** Ana-Maria Istrate and Kenneth E. Schackart III
21
+ - **Shared by:** Kenneth E. Schackart III
22
+ - **Model type:** RoBERTa (BERT; Transformer)
23
+ - **Language(s) (NLP):** Python
24
+ - **License:** MIT
25
+ - **Finetuned from model:** https://huggingface.co/allenai/dsp_roberta_base_dapt_biomed_tapt_rct_500
26
+
27
+ ## Model Sources
28
+
29
+ - **Repository:** https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev
30
+ - **Paper [optional]:** TBA
31
+ - **Demo [optional]:** TBA
32
+
33
+ # Uses
34
+
35
+ This model can be used find predicted biodata resource names in an article's title and abstract
36
+
37
+ ## Direct Use
38
+
39
+ Direct use of the model has not been assessed or designed.
40
+
41
+ ## Out-of-Scope Use
42
+
43
+ Model should not be used for anything other than the use described in [uses](named_entity_recognition_modelcard.md#uses).
44
+
45
+ # Bias, Risks, and Limitations
46
+
47
+ Biases may have been introduced at several stages of the development and training of this model. First, the model was trained on biomedical corpora
48
+ as described in [Gururangan S., *et al.,* 2020](http://arxiv.org/abs/2004.10964). Second, The model was fine-tuned on scientific articles that were
49
+ manually annotated by 2 curators. Biases in the manual annotation may have affected model fine-tuning. Additionally, manually annotated data were
50
+ procured using a specific search query to Europe PMC, so generalizability may be limited when applying to articles from other sources.
51
+
52
+ ## Recommendations
53
+
54
+ The model should only be used for identifying resource names in articles from Europe PMC using the
55
+ [query](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/config/query.txt) present in the GitHub repository.
56
+ Additionally, only article predicted or known to describe a biodata resource should be used.
57
+
58
+ ## How to Get Started with the Model
59
+
60
+ Follow the direction in the [GitHub repository](https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev).
61
+
62
+ # Training Details
63
+
64
+ ## Training Data
65
+
66
+ The model was trained on the training split from the [labeled training data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_ner_extraction.csv).
67
+
68
+ *Note*: The data can be split into consistent training, validation, testing splits using the procedures detailed in the GitHub repository.
69
+
70
+ ## Training Procedure
71
+
72
+ The model was trained for 10 epochs, and *F*1-score, precision, recall, and loss were computed after each epoch. The model checkpoint with the highest *F*1-score on the validation
73
+ set was saved (regardless of epoch number).
74
+
75
+ ### Preprocessing
76
+
77
+ To generate the input to the model, the article title and abstracts were concatenated, separating with one white space character, into a contiguous string. All
78
+ XML tags were removed using a regular expression.
79
+
80
+ ### Speeds, Sizes, Times
81
+
82
+ The model checkpoint is 496 MB. Speed has not been benchmarked.
83
+
84
+ # Evaluation
85
+
86
+ <!-- This section describes the evaluation protocols and provides the results. -->
87
+
88
+ ## Testing Data, Factors & Metrics
89
+
90
+ ### Testing Data
91
+
92
+ <!-- This should link to a Data Card if possible. -->
93
+
94
+ The model was evaluated using the test split of the [labeled data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_ner_extraction.csv).
95
+
96
+ ### Metrics
97
+
98
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
99
+
100
+ The model was evaluated using *F*1-score, precision, and recall. Precision was prioritized during fine-tuning and model selection.
101
+
102
+ ## Results
103
+
104
+ - *F*1-score: 0.717
105
+ - Precision: 0.689
106
+ - Recall: 0.748
107
+
108
+ ### Summary
109
+
110
+
111
+
112
+ # Model Examination
113
+
114
+ The model works satisfactorily for identifying resource names from articles describing biodata resources in the literature.
115
+
116
+ ## Model Architecture and Objective
117
+
118
+ The base model architecture is as described in [Gururangan S., *et al.,* 2020](http://arxiv.org/abs/2004.10964). Token classification is performed using
119
+ a linear sequence classification layer initialized using [transformers.AutoModelForTokenClassification()](https://huggingface.co/docs/transformers/model_doc/auto).
120
+
121
+ ## Compute Infrastructure
122
+
123
+ Model was fine-tuned on Google Colaboratory.
124
+
125
+ ### Hardware
126
+
127
+ Model was fine-tuned using GPU acceleration provided by Google Colaboratory.
128
+
129
+ ### Software
130
+
131
+ Training software was written in Python.
132
+
133
+ # Citation
134
+
135
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
136
+
137
+ TBA
138
+
139
+ **BibTeX:**
140
+
141
+ TBA
142
+
143
+ **APA:**
144
+
145
+ TBA
146
+
147
+ # Model Card Authors
148
+
149
+ This model card was written by Kenneth E. Schackart III.
150
+
151
+ # Model Card Contact
152
+
153
+ Ken Schackart: <schackartk1@gmail.com>
154
+
155
+