metadata

license: mit

Model Card for article_classifier.pt

This is a fine-tuned model checkpoint for the article classification task used in the biodata resource inventory performed by the Global Biodata Coalition in collaboration with Chan Zuckerberg Initiative.

Model Details

Model Description

This model has been fine-tuned to classify scientific articles (title and abstract) as either describing a biodata resource or not.

Developed by: Ana-Maria Istrate and Kenneth E. Schackart III
Shared by: Kenneth E. Schackart III
Model type: RoBERTa (BERT; Transformer)
Language(s) (NLP): Python
License: MIT
Finetuned from model: https://huggingface.co/allenai/dsp_roberta_base_dapt_biomed_tapt_rct_500

Model Sources

Repository: https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev
Paper [optional]: TBA
Demo [optional]: TBA

Uses

This model can be used to classify scientific articles as describing biodata resources or not.

Direct Use

Direct use of the model has not been assessed or designed.

Out-of-Scope Use

Model should not be used for anything other than the use described in uses.

Bias, Risks, and Limitations

Biases may have been introduced at several stages of the development and training of this model. First, the model was trained on biomedical corpora as described in Gururangan S., et al., 2020. Second, The model was fine-tuned on scientific articles that were manually classified by 2 curators. Biases in the manual classification may have affected model fine-tuning. Additionally, manually classified data were procured using a specific search query to Europe PMC, so generalizability may be limited when classifying articles from other sources.

Recommendations

The model should only be used for classifying articles from Europe PMC using the query present in the GitHub repository.

How to Get Started with the Model

Follow the direction in the GitHub repository.

Training Details

Training Data

The model was trained on the training split from the labeled training data.

Note: The data can be split into consistent training, validation, testing splits using the procedures detailed in the GitHub repository.

Training Procedure

The model was trained for 10 epochs, and F1-score, precision, recall, and loss were computed after each epoch. The model checkpoint with the highest precision on the validation set was saved (regardless of epoch number).

Preprocessing

To generate the input to the model, the article title and abstracts were concatenated, separating with one white space character, into a contiguous string. All XML tags were removed using a regular expression.

Speeds, Sizes, Times

The model checkpoint is 499 MB. Speed has not been benchmarked.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated using the test split of the labeled data.

Metrics

The model was evaluated using F1-score, precision, and recall. Precision was prioritized during fine-tuning and model selection.

Results

F1-score: 0.821
Precision: 0.975
Recall: 0.709

Summary

Model Examination

The model works satisfactorily for identifying articles describing biodata resources from the literature.

Model Architecture and Objective

The base model architecture is as described in Gururangan S., et al., 2020. Classification is performed using a linear sequence classification layer initialized using transformers.AutoModelForSequenceClassification().

Compute Infrastructure

Model was fine-tuned on Google Colaboratory.

Hardware

Model was fine-tuned using GPU acceleration provided by Google Colaboratory.

Software

Training software was written in Python.

Citation

TBA

BibTeX:

TBA

APA:

TBA

Model Card Authors

This model card was written by Kenneth E. Schackart III.

Model Card Contact

Ken Schackart: schackartk1@gmail.com