inventory / article_classifier_modelcard.md

schackartk

correct test metrics

742fa1e over 1 year ago

preview code

raw

history blame contribute delete

No virus

4.96 kB

	---
	license: mit
	---

	# Model Card for article_classifier.pt

	This is a fine-tuned model checkpoint for the article classification task used in the biodata resource inventory performed by the
	[Global Biodata Coalition](https://globalbiodata.org/) in collaboration with [Chan Zuckerberg Initiative](https://chanzuckerberg.com/).

	# Model Details

	## Model Description

	This model has been fine-tuned to classify scientific articles (title and abstract) as either describing a biodata resource or not.



	- Developed by: Ana-Maria Istrate and Kenneth E. Schackart III
	- Shared by: Kenneth E. Schackart III
	- Model type: RoBERTa (BERT; Transformer)
	- Language(s) (NLP): Python
	- License: MIT
	- Finetuned from model: https://huggingface.co/allenai/dsp_roberta_base_dapt_biomed_tapt_rct_500

	## Model Sources

	- Repository: https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev
	- Paper [optional]: TBA
	- Demo [optional]: TBA

	# Uses

	This model can be used to classify scientific articles as describing biodata resources or not.

	## Direct Use

	Direct use of the model has not been assessed or designed.

	## Out-of-Scope Use

	Model should not be used for anything other than the use described in [uses](article_classification_modelcard.md#uses).

	# Bias, Risks, and Limitations

	Biases may have been introduced at several stages of the development and training of this model. First, the model was trained on biomedical corpora
	as described in [Gururangan S., et al., 2020](http://arxiv.org/abs/2004.10964). Second, The model was fine-tuned on scientific articles that were
	manually classified by 2 curators. Biases in the manual classification may have affected model fine-tuning. Additionally, manually classified data were
	procured using a specific search query to Europe PMC, so generalizability may be limited when classifying articles from other sources.

	## Recommendations

	The model should only be used for classifying articles from Europe PMC using the
	[query](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/config/query.txt) present in the GitHub repository.

	## How to Get Started with the Model

	Follow the direction in the [GitHub repository](https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev).

	# Training Details

	## Training Data

	The model was trained on the training split from the [labeled training data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_classifications.csv).

	Note: The data can be split into consistent training, validation, testing splits using the procedures detailed in the GitHub repository.

	## Training Procedure

	The model was trained for 10 epochs, and F1-score, precision, recall, and loss were computed after each epoch. The model checkpoint with the highest precision on the validation
	set was saved (regardless of epoch number).

	### Preprocessing

	To generate the input to the model, the article title and abstracts were concatenated, separating with one white space character, into a contiguous string. All
	XML tags were removed using a regular expression.

	### Speeds, Sizes, Times

	The model checkpoint is 499 MB. Speed has not been benchmarked.

	# Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->

	## Testing Data, Factors & Metrics

	### Testing Data

	<!-- This should link to a Data Card if possible. -->

	The model was evaluated using the test split of the [labeled data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_classifications.csv).

	### Metrics

	<!-- These are the evaluation metrics being used, ideally with a description of why. -->

	The model was evaluated using F1-score, precision, and recall. Precision was prioritized during fine-tuning and model selection.

	## Results

	- F1-score: 0.821
	- Precision: 0.975
	- Recall: 0.709

	### Summary



	# Model Examination

	The model works satisfactorily for identifying articles describing biodata resources from the literature.

	## Model Architecture and Objective

	The base model architecture is as described in [Gururangan S., et al., 2020](http://arxiv.org/abs/2004.10964). Classification is performed using
	a linear sequence classification layer initialized using [transformers.AutoModelForSequenceClassification()](https://huggingface.co/docs/transformers/model_doc/auto).

	## Compute Infrastructure

	Model was fine-tuned on Google Colaboratory.

	### Hardware

	Model was fine-tuned using GPU acceleration provided by Google Colaboratory.

	### Software

	Training software was written in Python.

	# Citation

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	TBA

	BibTeX:

	TBA

	APA:

	TBA

	# Model Card Authors

	This model card was written by Kenneth E. Schackart III.

	# Model Card Contact

	Ken Schackart: <schackartk1@gmail.com>