Commit
·
d149041
1
Parent(s):
0e5aaef
Add model card for NER model
Browse files
named_entity_recognition_modelcard.md
ADDED
@@ -0,0 +1,155 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
---
|
4 |
+
|
5 |
+
# Model Card for named_entity_recognition.pt
|
6 |
+
|
7 |
+
This is a fine-tuned model checkpoint for the named entity recognition (NER) task used in the biodata resource inventory performed by the
|
8 |
+
[Global Biodata Coalition](https://globalbiodata.org/) in collaboration with [Chan Zuckerberg Initiative](https://chanzuckerberg.com/).
|
9 |
+
|
10 |
+
# Model Details
|
11 |
+
|
12 |
+
## Model Description
|
13 |
+
|
14 |
+
This model has been fine-tuned to detect resource names in scientific articles (title and abstract). This is done using a token classification which assigns predicted
|
15 |
+
token labels following the [BIO scheme](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). These are post-processed to determine the
|
16 |
+
predicted "common names" (often an acronym) and "full names" of a resource present in an article.
|
17 |
+
|
18 |
+
|
19 |
+
|
20 |
+
- **Developed by:** Ana-Maria Istrate and Kenneth E. Schackart III
|
21 |
+
- **Shared by:** Kenneth E. Schackart III
|
22 |
+
- **Model type:** RoBERTa (BERT; Transformer)
|
23 |
+
- **Language(s) (NLP):** Python
|
24 |
+
- **License:** MIT
|
25 |
+
- **Finetuned from model:** https://huggingface.co/allenai/dsp_roberta_base_dapt_biomed_tapt_rct_500
|
26 |
+
|
27 |
+
## Model Sources
|
28 |
+
|
29 |
+
- **Repository:** https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev
|
30 |
+
- **Paper [optional]:** TBA
|
31 |
+
- **Demo [optional]:** TBA
|
32 |
+
|
33 |
+
# Uses
|
34 |
+
|
35 |
+
This model can be used find predicted biodata resource names in an article's title and abstract
|
36 |
+
|
37 |
+
## Direct Use
|
38 |
+
|
39 |
+
Direct use of the model has not been assessed or designed.
|
40 |
+
|
41 |
+
## Out-of-Scope Use
|
42 |
+
|
43 |
+
Model should not be used for anything other than the use described in [uses](named_entity_recognition_modelcard.md#uses).
|
44 |
+
|
45 |
+
# Bias, Risks, and Limitations
|
46 |
+
|
47 |
+
Biases may have been introduced at several stages of the development and training of this model. First, the model was trained on biomedical corpora
|
48 |
+
as described in [Gururangan S., *et al.,* 2020](http://arxiv.org/abs/2004.10964). Second, The model was fine-tuned on scientific articles that were
|
49 |
+
manually annotated by 2 curators. Biases in the manual annotation may have affected model fine-tuning. Additionally, manually annotated data were
|
50 |
+
procured using a specific search query to Europe PMC, so generalizability may be limited when applying to articles from other sources.
|
51 |
+
|
52 |
+
## Recommendations
|
53 |
+
|
54 |
+
The model should only be used for identifying resource names in articles from Europe PMC using the
|
55 |
+
[query](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/config/query.txt) present in the GitHub repository.
|
56 |
+
Additionally, only article predicted or known to describe a biodata resource should be used.
|
57 |
+
|
58 |
+
## How to Get Started with the Model
|
59 |
+
|
60 |
+
Follow the direction in the [GitHub repository](https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev).
|
61 |
+
|
62 |
+
# Training Details
|
63 |
+
|
64 |
+
## Training Data
|
65 |
+
|
66 |
+
The model was trained on the training split from the [labeled training data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_ner_extraction.csv).
|
67 |
+
|
68 |
+
*Note*: The data can be split into consistent training, validation, testing splits using the procedures detailed in the GitHub repository.
|
69 |
+
|
70 |
+
## Training Procedure
|
71 |
+
|
72 |
+
The model was trained for 10 epochs, and *F*1-score, precision, recall, and loss were computed after each epoch. The model checkpoint with the highest *F*1-score on the validation
|
73 |
+
set was saved (regardless of epoch number).
|
74 |
+
|
75 |
+
### Preprocessing
|
76 |
+
|
77 |
+
To generate the input to the model, the article title and abstracts were concatenated, separating with one white space character, into a contiguous string. All
|
78 |
+
XML tags were removed using a regular expression.
|
79 |
+
|
80 |
+
### Speeds, Sizes, Times
|
81 |
+
|
82 |
+
The model checkpoint is 496 MB. Speed has not been benchmarked.
|
83 |
+
|
84 |
+
# Evaluation
|
85 |
+
|
86 |
+
<!-- This section describes the evaluation protocols and provides the results. -->
|
87 |
+
|
88 |
+
## Testing Data, Factors & Metrics
|
89 |
+
|
90 |
+
### Testing Data
|
91 |
+
|
92 |
+
<!-- This should link to a Data Card if possible. -->
|
93 |
+
|
94 |
+
The model was evaluated using the test split of the [labeled data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_ner_extraction.csv).
|
95 |
+
|
96 |
+
### Metrics
|
97 |
+
|
98 |
+
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
99 |
+
|
100 |
+
The model was evaluated using *F*1-score, precision, and recall. Precision was prioritized during fine-tuning and model selection.
|
101 |
+
|
102 |
+
## Results
|
103 |
+
|
104 |
+
- *F*1-score: 0.717
|
105 |
+
- Precision: 0.689
|
106 |
+
- Recall: 0.748
|
107 |
+
|
108 |
+
### Summary
|
109 |
+
|
110 |
+
|
111 |
+
|
112 |
+
# Model Examination
|
113 |
+
|
114 |
+
The model works satisfactorily for identifying resource names from articles describing biodata resources in the literature.
|
115 |
+
|
116 |
+
## Model Architecture and Objective
|
117 |
+
|
118 |
+
The base model architecture is as described in [Gururangan S., *et al.,* 2020](http://arxiv.org/abs/2004.10964). Token classification is performed using
|
119 |
+
a linear sequence classification layer initialized using [transformers.AutoModelForTokenClassification()](https://huggingface.co/docs/transformers/model_doc/auto).
|
120 |
+
|
121 |
+
## Compute Infrastructure
|
122 |
+
|
123 |
+
Model was fine-tuned on Google Colaboratory.
|
124 |
+
|
125 |
+
### Hardware
|
126 |
+
|
127 |
+
Model was fine-tuned using GPU acceleration provided by Google Colaboratory.
|
128 |
+
|
129 |
+
### Software
|
130 |
+
|
131 |
+
Training software was written in Python.
|
132 |
+
|
133 |
+
# Citation
|
134 |
+
|
135 |
+
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
136 |
+
|
137 |
+
TBA
|
138 |
+
|
139 |
+
**BibTeX:**
|
140 |
+
|
141 |
+
TBA
|
142 |
+
|
143 |
+
**APA:**
|
144 |
+
|
145 |
+
TBA
|
146 |
+
|
147 |
+
# Model Card Authors
|
148 |
+
|
149 |
+
This model card was written by Kenneth E. Schackart III.
|
150 |
+
|
151 |
+
# Model Card Contact
|
152 |
+
|
153 |
+
Ken Schackart: <schackartk1@gmail.com>
|
154 |
+
|
155 |
+
|