File size: 4,963 Bytes
796655d ff777d3 796655d d515295 796655d d515295 796655d d515295 796655d d515295 796655d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
---
license: mit
---
# Model Card for article_classifier.pt
This is a fine-tuned model checkpoint for the article classification task used in the biodata resource inventory performed by the
[Global Biodata Coalition](https://globalbiodata.org/) in collaboration with [Chan Zuckerberg Initiative](https://chanzuckerberg.com/).
# Model Details
## Model Description
This model has been fine-tuned to classify scientific articles (title and abstract) as either describing a biodata resource or not.
- **Developed by:** Ana-Maria Istrate and Kenneth E. Schackart III
- **Shared by:** Kenneth E. Schackart III
- **Model type:** RoBERTa (BERT; Transformer)
- **Language(s) (NLP):** Python
- **License:** MIT
- **Finetuned from model:** https://huggingface.co/allenai/dsp_roberta_base_dapt_biomed_tapt_rct_500
## Model Sources
- **Repository:** https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev
- **Paper [optional]:** TBA
- **Demo [optional]:** TBA
# Uses
This model can be used to classify scientific articles as describing biodata resources or not.
## Direct Use
Direct use of the model has not been assessed or designed.
## Out-of-Scope Use
Model should not be used for anything other than the use described in [uses](article_classification_modelcard.md#uses).
# Bias, Risks, and Limitations
Biases may have been introduced at several stages of the development and training of this model. First, the model was trained on biomedical corpora
as described in [Gururangan S., *et al.,* 2020](http://arxiv.org/abs/2004.10964). Second, The model was fine-tuned on scientific articles that were
manually classified by 2 curators. Biases in the manual classification may have affected model fine-tuning. Additionally, manually classified data were
procured using a specific search query to Europe PMC, so generalizability may be limited when classifying articles from other sources.
## Recommendations
The model should only be used for classifying articles from Europe PMC using the
[query](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/config/query.txt) present in the GitHub repository.
## How to Get Started with the Model
Follow the direction in the [GitHub repository](https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev).
# Training Details
## Training Data
The model was trained on the training split from the [labeled training data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_classifications.csv).
*Note*: The data can be split into consistent training, validation, testing splits using the procedures detailed in the GitHub repository.
## Training Procedure
The model was trained for 10 epochs, and *F*1-score, precision, recall, and loss were computed after each epoch. The model checkpoint with the highest precision on the validation
set was saved (regardless of epoch number).
### Preprocessing
To generate the input to the model, the article title and abstracts were concatenated, separating with one white space character, into a contiguous string. All
XML tags were removed using a regular expression.
### Speeds, Sizes, Times
The model checkpoint is 499 MB. Speed has not been benchmarked.
# Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
## Testing Data, Factors & Metrics
### Testing Data
<!-- This should link to a Data Card if possible. -->
The model was evaluated using the test split of the [labeled data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_classifications.csv).
### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
The model was evaluated using *F*1-score, precision, and recall. Precision was prioritized during fine-tuning and model selection.
## Results
- *F*1-score: 0.849
- Precision: 0.939
- Recall: 0.775
### Summary
# Model Examination
The model works satisfactorily for identifying articles describing biodata resources from the literature.
## Model Architecture and Objective
The base model architecture is as described in [Gururangan S., *et al.,* 2020](http://arxiv.org/abs/2004.10964). Classification is performed using
a linear sequence classification layer initialized using [transformers.AutoModelForSequenceClassification()](https://huggingface.co/docs/transformers/model_doc/auto).
## Compute Infrastructure
Model was fine-tuned on Google Colaboratory.
### Hardware
Model was fine-tuned using GPU acceleration provided by Google Colaboratory.
### Software
Training software was written in Python.
# Citation
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
TBA
**BibTeX:**
TBA
**APA:**
TBA
# Model Card Authors
This model card was written by Kenneth E. Schackart III.
# Model Card Contact
Ken Schackart: <schackartk1@gmail.com>
|