--- license: cc-by-nc-nd-4.0 metrics: - accuracy tags: - generated_from_trainer - Text Generation - Primary Sequence Prediction model-index: - name: protgpt2-finetuned-sarscov2-rbd results: [] --- # Model Card for `protgpt2-finetuned-sarscov2-rbd` This model is a fine-tuned version of [nferruz/ProtGPT2](https://huggingface.co/nferruz/ProtGPT2) on sequences from the NCBI Virus Data Portal. It achieves the following results on the evaluation set: - Loss: 1.1674 - Accuracy: 0.8883 ## Model description This model is a fine-tuned checkpoint of [ProtGPT2](https://huggingface.co/nferruz/ProtGPT2), which was originally trained on the UniRef50 (version 2021_04) database. For a detailed overview of the original model configuration and architecture, please see the linked model card, or refer to the ProtGPT2 publication. The model was finetuned on data from the SARS-CoV-2 Spike (surface glycoprotein) receptor binding domain (RBD). A repository with the training scripts, train and test data partitions, as well as evaluation code is available on GitHub at (https://github.com/rahuldhodapkar/PredictSARSVariants). ## Intended uses & limitations This model is intended to generate synthetic SARS-CoV-2 surface glycoprotein (a.k.a. spike protein) sequences for the purpose of identifying meaningful variants for characterization either experimentally or through other *in silico* tools. These variants may be used to drive vaccine develop to protect against never-before-seen point mutants that are probable in the future. As this model is based on the original ProtGPT2 model, it is subject to many of the same limitations as the base model. Any biases present in the UniRef50 dataset will also be present in the model, which may include nonuniform skew of peptides sampled across different taxonomic clades. These limitations should be considered when interpreting the output of this model. ## Training and evaluation data SARS-CoV-2 spike protein sequences were obtained from the NIH Sars-CoV-2 Data Hub accessible at https://www.ncbi.nlm.nih.gov/labs/virus/vssi/ Note that the reference sequence for the surface glycoprotein can be found at: https://www.ncbi.nlm.nih.gov/protein/1791269090 As the loaded ProtGPT2 model was pretrained on the UniRef50 (version 2021_04) dataset, it cannot have contained sequencing data that was generated after that date. Evaluations will be conducted using SARS-CoV-2 sequences generated on or after May 2021. ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 1e-05 - train_batch_size: 16 - eval_batch_size: 16 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 3.0 ### Framework versions - Transformers 4.26.0.dev0 - Pytorch 1.11.0 - Datasets 2.8.0 - Tokenizers 0.13.2