rahuldhodapkar's picture
initial commit
edb7a65
metadata
license: cc-by-nc-nd-4.0
metrics:
  - accuracy
tags:
  - generated_from_trainer
  - Text Generation
  - Primary Sequence Prediction
model-index:
  - name: protgpt2-finetuned-sarscov2-rbd
    results: []

Model Card for protgpt2-finetuned-sarscov2-rbd

This model is a fine-tuned version of nferruz/ProtGPT2 on sequences from the NCBI Virus Data Portal.

It achieves the following results on the evaluation set:

  • Loss: 1.1674
  • Accuracy: 0.8883

Model description

This model is a fine-tuned checkpoint of ProtGPT2, which was originally trained on the UniRef50 (version 2021_04) database. For a detailed overview of the original model configuration and architecture, please see the linked model card, or refer to the ProtGPT2 publication.

The model was finetuned on data from the SARS-CoV-2 Spike (surface glycoprotein) receptor binding domain (RBD).

A repository with the training scripts, train and test data partitions, as well as evaluation code is available on GitHub at (https://github.com/rahuldhodapkar/PredictSARSVariants).

Intended uses & limitations

This model is intended to generate synthetic SARS-CoV-2 surface glycoprotein (a.k.a. spike protein) sequences for the purpose of identifying meaningful variants for characterization either experimentally or through other in silico tools. These variants may be used to drive vaccine develop to protect against never-before-seen point mutants that are probable in the future.

As this model is based on the original ProtGPT2 model, it is subject to many of the same limitations as the base model. Any biases present in the UniRef50 dataset will also be present in the model, which may include nonuniform skew of peptides sampled across different taxonomic clades. These limitations should be considered when interpreting the output of this model.

Training and evaluation data

SARS-CoV-2 spike protein sequences were obtained from the NIH Sars-CoV-2 Data Hub accessible at

https://www.ncbi.nlm.nih.gov/labs/virus/vssi/

Note that the reference sequence for the surface glycoprotein can be found at:

https://www.ncbi.nlm.nih.gov/protein/1791269090

As the loaded ProtGPT2 model was pretrained on the UniRef50 (version 2021_04) dataset, it cannot have contained sequencing data that was generated after that date. Evaluations will be conducted using SARS-CoV-2 sequences generated on or after May 2021.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 16
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 3.0

Framework versions

  • Transformers 4.26.0.dev0
  • Pytorch 1.11.0
  • Datasets 2.8.0
  • Tokenizers 0.13.2