|
--- |
|
language: protein |
|
tags: |
|
- protein language model |
|
datasets: |
|
- BFD |
|
- Custom Rosetta |
|
--- |
|
|
|
# ProtBert-BFD finetuned on Rosetta 20AA dataset |
|
|
|
This model is finetuned to predict Rosetta fold energy using a dataset of 100k 20AA sequences. |
|
|
|
Current model in this repo: `prot_bert_bfd-finetuned-032722_1752` |
|
|
|
## Performance |
|
|
|
- 20AA sequences (1k eval set):\ |
|
Metrics: 'mae': 0.090115, 'r2': 0.991208, 'mse': 0.013034, 'rmse': 0.114165 |
|
|
|
- 40AA sequences (10k eval set):\ |
|
Metrics: 'mae': 0.537456, 'r2': 0.659122, 'mse': 0.448607, 'rmse': 0.669781 |
|
|
|
- 60AA sequences (10k eval set):\ |
|
Metrics: 'mae': 0.629267, 'r2': 0.506747, 'mse': 0.622476, 'rmse': 0.788972 |
|
|
|
|
|
## `prot_bert_bfd` from ProtTrans |
|
The starting pretrained model is from ProtTrans, trained on 2.1 billion proteins from BFD. |
|
It was trained on protein sequences using a masked language modeling (MLM) objective. It was introduced in |
|
[this paper](https://doi.org/10.1101/2020.07.12.199554) and first released in |
|
[this repository](https://github.com/agemagician/ProtTrans). |
|
|
|
> Created by [Ladislav Rampasek](https://rampasek.github.io) |
|
|