---
license: mit
language: protein
tags:
- protein language model
datasets:
- Uniref50
---

# DistilProtBert

Distilled version of [ProtBert](https://huggingface.co/Rostlab/prot_bert) model.
In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective  and it only works with capital letter amino acids.

# Model description

DistilProtBert was pretrained on millions of proteins sequences. 

Few important differences between DistilProtBert model and the original ProtBert version are:
1. Size of the model
2. Size of the pretraining dataset
3. Hardware used for pretraining

## Intended uses & limitations

The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.

### How to use

The model can be used the same as ProtBert.

## Training data

DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used).

# Pretraining procedure

Preprocessing was done using ProtBert's tokenizer.
The details of the masking procedure for each sequence followed the original Bert (as mentioned in [ProtBert](https://huggingface.co/Rostlab/prot_bert)). 

The model was pretrained on a single DGX cluster for 3 epochs in total. local batch size was 16, the optimizer used was AdamW with a learning rate of 5e-5 and mixed precision settings.

## Evaluation results

When fine-tuned on downstream tasks, this model achieves the following results:

| Task/Dataset | secondary structure (3-states) | Membrane  |
|:-----:|:-----:|:-----:|
|   CASP12  | 72 |    |
|   TS115   | 81 |    | 
|   CB513   | 79 |    |
|  DeepLoc  |    | 86 | 

Distinguish between:

### BibTeX entry and citation info