Edit model card

DistilProtBert

A distilled version of ProtBert-UniRef100 model. In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective and it only works with capital letter amino acids.

Check out our paper DistilProtBert: A distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts for more details.

Git repository.

Model details

Model # of parameters # of hidden layers Pretraining dataset # of proteins Pretraining hardware
ProtBert 420M 30 UniRef100 216M 512 16GB TPUs
DistilProtBert 230M 15 UniRef50 43M 5 v100 32GB GPUs

Intended uses & limitations

The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.

How to use

The model can be used the same as ProtBert and with ProtBert's tokenizer.

Training data

DistilProtBert model was pretrained on Uniref50, a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used).

Pretraining procedure

Preprocessing was done using ProtBert's tokenizer. The details of the masking procedure for each sequence followed the original Bert (as mentioned in ProtBert).

The model was pretrained on a single DGX cluster for 3 epochs in total. local batch size was 16, the optimizer used was AdamW with a learning rate of 5e-5 and mixed precision settings.

Evaluation results

When fine-tuned on downstream tasks, this model achieves the following results:

Task/Dataset secondary structure (3-states) Membrane
CASP12 72
TS115 81
CB513 79
DeepLoc 86

Distinguish between proteins and their k-let shuffled versions:

Singlet (dataset)

Model AUC
LSTM 0.71
ProtBert 0.93
DistilProtBert 0.92

Doublet (dataset)

Model AUC
LSTM 0.68
ProtBert 0.92
DistilProtBert 0.91

Triplet (dataset)

Model AUC
LSTM 0.61
ProtBert 0.92
DistilProtBert 0.87

Citation

If you use this model, please cite our paper:

@article {
    author = {Geffen, Yaron and Ofran, Yanay and Unger, Ron},
    title = {DistilProtBert: A distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts},
    year = {2022},
    doi = {10.1093/bioinformatics/btac474},
    URL = {https://doi.org/10.1093/bioinformatics/btac474},
    journal = {Bioinformatics}
}
Downloads last month
3,500