|
--- |
|
license: mit |
|
--- |
|
|
|
# Model Card for article_classifier.pt |
|
|
|
This is a fine-tuned model checkpoint for the article classification task used in the biodata resource inventory performed by the |
|
[Global Biodata Coalition](https://globalbiodata.org/) in collaboration with [Chan Zuckerberg Initiative](https://chanzuckerberg.com/). |
|
|
|
# Model Details |
|
|
|
## Model Description |
|
|
|
This model has been fine-tuned to classify scientific articles (title and abstract) as either describing a biodata resource or not. |
|
|
|
|
|
|
|
- **Developed by:** Ana-Maria Istrate and Kenneth E. Schackart III |
|
- **Shared by:** Kenneth E. Schackart III |
|
- **Model type:** RoBERTa (BERT; Transformer) |
|
- **Language(s) (NLP):** Python |
|
- **License:** MIT |
|
- **Finetuned from model:** https://huggingface.co/allenai/dsp_roberta_base_dapt_biomed_tapt_rct_500 |
|
|
|
## Model Sources |
|
|
|
- **Repository:** https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev |
|
- **Paper [optional]:** TBA |
|
- **Demo [optional]:** TBA |
|
|
|
# Uses |
|
|
|
This model can be used to classify scientific articles as describing biodata resources or not. |
|
|
|
## Direct Use |
|
|
|
Direct use of the model has not been assessed or designed. |
|
|
|
## Out-of-Scope Use |
|
|
|
Model should not be used for anything other than the use described in [uses](article_classification_modelcard.md#uses). |
|
|
|
# Bias, Risks, and Limitations |
|
|
|
Biases may have been introduced at several stages of the development and training of this model. First, the model was trained on biomedical corpora |
|
as described in [Gururangan S., *et al.,* 2020](http://arxiv.org/abs/2004.10964). Second, The model was fine-tuned on scientific articles that were |
|
manually classified by 2 curators. Biases in the manual classification may have affected model fine-tuning. Additionally, manually classified data were |
|
procured using a specific search query to Europe PMC, so generalizability may be limited when classifying articles from other sources. |
|
|
|
## Recommendations |
|
|
|
The model should only be used for classifying articles from Europe PMC using the |
|
[query](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/config/query.txt) present in the GitHub repository. |
|
|
|
## How to Get Started with the Model |
|
|
|
Follow the direction in the [GitHub repository](https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev). |
|
|
|
# Training Details |
|
|
|
## Training Data |
|
|
|
The model was trained on the training split from the [labeled training data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_classifications.csv). |
|
|
|
*Note*: The data can be split into consistent training, validation, testing splits using the procedures detailed in the GitHub repository. |
|
|
|
## Training Procedure |
|
|
|
The model was trained for 10 epochs, and *F*1-score, precision, recall, and loss were computed after each epoch. The model checkpoint with the highest precision on the validation |
|
set was saved (regardless of epoch number). |
|
|
|
### Preprocessing |
|
|
|
To generate the input to the model, the article title and abstracts were concatenated, separating with one white space character, into a contiguous string. All |
|
XML tags were removed using a regular expression. |
|
|
|
### Speeds, Sizes, Times |
|
|
|
The model checkpoint is 499 MB. Speed has not been benchmarked. |
|
|
|
# Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
## Testing Data, Factors & Metrics |
|
|
|
### Testing Data |
|
|
|
<!-- This should link to a Data Card if possible. --> |
|
|
|
The model was evaluated using the test split of the [labeled data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_classifications.csv). |
|
|
|
### Metrics |
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
|
The model was evaluated using *F*1-score, precision, and recall. Precision was prioritized during fine-tuning and model selection. |
|
|
|
## Results |
|
|
|
- *F*1-score: 0.849 |
|
- Precision: 0.939 |
|
- Recall: 0.775 |
|
|
|
### Summary |
|
|
|
|
|
|
|
# Model Examination |
|
|
|
The model works satisfactorily for identifying articles describing biodata resources from the literature. |
|
|
|
## Model Architecture and Objective |
|
|
|
The base model architecture is as described in [Gururangan S., *et al.,* 2020](http://arxiv.org/abs/2004.10964). Classification is performed using |
|
a linear sequence classification layer initialized using [transformers.AutoModelForSequenceClassification()](https://huggingface.co/docs/transformers/model_doc/auto). |
|
|
|
## Compute Infrastructure |
|
|
|
Model was fine-tuned on Google Colaboratory. |
|
|
|
### Hardware |
|
|
|
Model was fine-tuned using GPU acceleration provided by Google Colaboratory. |
|
|
|
### Software |
|
|
|
Training software was written in Python. |
|
|
|
# Citation |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
TBA |
|
|
|
**BibTeX:** |
|
|
|
TBA |
|
|
|
**APA:** |
|
|
|
TBA |
|
|
|
# Model Card Authors |
|
|
|
This model card was written by Kenneth E. Schackart III. |
|
|
|
# Model Card Contact |
|
|
|
Ken Schackart: <schackartk1@gmail.com> |
|
|
|
|
|
|