File size: 2,879 Bytes
cc22f22
 
edb7a65
 
 
 
 
 
 
 
 
cc22f22
edb7a65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
license: cc-by-nc-nd-4.0
metrics:
- accuracy
tags:
- generated_from_trainer
- Text Generation
- Primary Sequence Prediction
model-index:
- name: protgpt2-finetuned-sarscov2-rbd
  results: []
---

# Model Card for `protgpt2-finetuned-sarscov2-rbd`

This model is a fine-tuned version of [nferruz/ProtGPT2](https://huggingface.co/nferruz/ProtGPT2) on sequences from the NCBI Virus Data Portal.

It achieves the following results on the evaluation set:
- Loss: 1.1674
- Accuracy: 0.8883

## Model description

This model is a fine-tuned checkpoint of
[ProtGPT2](https://huggingface.co/nferruz/ProtGPT2), which was originally
trained on the UniRef50 (version 2021_04) database. For a detailed overview
of the original model configuration and architecture, please see the linked
model card, or refer to the ProtGPT2 publication.

The model was finetuned on data from the SARS-CoV-2 Spike (surface glycoprotein)
receptor binding domain (RBD).

A repository with the training scripts, train and test data partitions, as well
as evaluation code is available on GitHub at
(https://github.com/rahuldhodapkar/PredictSARSVariants).

## Intended uses & limitations

This model is intended to generate synthetic SARS-CoV-2 surface glycoprotein
(a.k.a. spike protein) sequences for the purpose of identifying meaningful
variants for characterization either experimentally or through other
*in silico* tools.  These variants may be used to drive vaccine develop to
protect against never-before-seen point mutants that are probable in the future.

As this model is based on the original ProtGPT2 model, it is subject to many
of the same limitations as the base model.  Any biases present in the UniRef50
dataset will also be present in the model, which may include nonuniform skew
of peptides sampled across different taxonomic clades.  These limitations
should be considered when interpreting the output of this model.

## Training and evaluation data

SARS-CoV-2 spike protein sequences were obtained from the NIH Sars-CoV-2 Data Hub
accessible at 

    https://www.ncbi.nlm.nih.gov/labs/virus/vssi/

Note that the reference sequence for the surface glycoprotein can be found at:

    https://www.ncbi.nlm.nih.gov/protein/1791269090

As the loaded ProtGPT2 model was pretrained on the
UniRef50 (version 2021_04) dataset, it cannot have contained sequencing
data that was generated after that date.  Evaluations will be conducted using
SARS-CoV-2 sequences generated on or after May 2021.

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3.0

### Framework versions

- Transformers 4.26.0.dev0
- Pytorch 1.11.0
- Datasets 2.8.0
- Tokenizers 0.13.2