--- license: mit --- # Model Card for article_classifier.pt This is a fine-tuned model checkpoint for the article classification task used in the biodata resource inventory performed by the [Global Biodata Coalition](https://globalbiodata.org/) in collaboration with [Chan Zuckerberg Initiative](https://chanzuckerberg.com/). # Model Details ## Model Description This model has been fine-tuned to classify scientific articles (title and abstract) as either describing a biodata resource or not. - **Developed by:** Ana-Maria Istrate and Kenneth E. Schackart III - **Shared by:** Kenneth E. Schackart III - **Model type:** RoBERTa (BERT; Transformer) - **Language(s) (NLP):** Python - **License:** MIT - **Finetuned from model:** https://huggingface.co/allenai/dsp_roberta_base_dapt_biomed_tapt_rct_500 ## Model Sources - **Repository:** https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev - **Paper [optional]:** TBA - **Demo [optional]:** TBA # Uses This model can be used to classify scientific articles as describing biodata resources or not. ## Direct Use Direct use of the model has not been assessed or designed. ## Out-of-Scope Use Model should not be used for anything other than the use described in [uses](article_classification_modelcard.md#uses). # Bias, Risks, and Limitations Biases may have been introduced at several stages of the development and training of this model. First, the model was trained on biomedical corpora as described in [Gururangan S., *et al.,* 2020](http://arxiv.org/abs/2004.10964). Second, The model was fine-tuned on scientific articles that were manually classified by 2 curators. Biases in the manual classification may have affected model fine-tuning. Additionally, manually classified data were procured using a specific search query to Europe PMC, so generalizability may be limited when classifying articles from other sources. ## Recommendations The model should only be used for classifying articles from Europe PMC using the [query](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/config/query.txt) present in the GitHub repository. ## How to Get Started with the Model Follow the direction in the [GitHub repository](https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev). # Training Details ## Training Data The model was trained on the training split from the [labeled training data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_classifications.csv). *Note*: The data can be split into consistent training, validation, testing splits using the procedures detailed in the GitHub repository. ## Training Procedure The model was trained for 10 epochs, and *F*1-score, precision, recall, and loss were computed after each epoch. The model checkpoint with the highest precision on the validation set was saved (regardless of epoch number). ### Preprocessing To generate the input to the model, the article title and abstracts were concatenated, separating with one white space character, into a contiguous string. All XML tags were removed using a regular expression. ### Speeds, Sizes, Times The model checkpoint is 499 MB. Speed has not been benchmarked. # Evaluation ## Testing Data, Factors & Metrics ### Testing Data The model was evaluated using the test split of the [labeled data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_classifications.csv). ### Metrics The model was evaluated using *F*1-score, precision, and recall. Precision was prioritized during fine-tuning and model selection. ## Results - *F*1-score: 0.821 - Precision: 0.975 - Recall: 0.709 ### Summary # Model Examination The model works satisfactorily for identifying articles describing biodata resources from the literature. ## Model Architecture and Objective The base model architecture is as described in [Gururangan S., *et al.,* 2020](http://arxiv.org/abs/2004.10964). Classification is performed using a linear sequence classification layer initialized using [transformers.AutoModelForSequenceClassification()](https://huggingface.co/docs/transformers/model_doc/auto). ## Compute Infrastructure Model was fine-tuned on Google Colaboratory. ### Hardware Model was fine-tuned using GPU acceleration provided by Google Colaboratory. ### Software Training software was written in Python. # Citation TBA **BibTeX:** TBA **APA:** TBA # Model Card Authors This model card was written by Kenneth E. Schackart III. # Model Card Contact Ken Schackart: