ProteinBERT

Pre-trained model on protein sequences and Gene Ontology annotations using a combined language modeling and annotation prediction objective.

Disclaimer

This is an UNOFFICIAL implementation of the ProteinBERT: a universal deep-learning model of protein sequence and function by Nadav Brandes, et al.

The OFFICIAL repository of ProteinBERT is at nadavbra/protein_bert.

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing ProteinBERT did not write this model card for this model so this model card has been written by the MultiMolecule team.

Model Details

ProteinBERT is a protein language model with coupled local residue representations and a global protein representation. It is pre-trained on UniRef90 with a sequence language modeling objective and a Gene Ontology annotation recovery objective. ProteinBERT uses convolutional local branches and global-attention layers instead of quadratic self-attention, so the architecture has no learned positional table and can be evaluated on variable sequence lengths.

Model Specification

Num Layers Hidden Size Global Hidden Size Num Heads Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens
6 128 512 4 15.98 7.16 3.54 1024

Links

Usage

The model file depends on the multimolecule library. You can install it using pip:

pip install multimolecule

Direct Use

Masked Language Modeling

You can use this model directly with a pipeline for masked language modeling:

import multimolecule  # you must import multimolecule to register models
from transformers import pipeline

predictor = pipeline("fill-mask", model="multimolecule/proteinbert")
output = predictor("MVLSPADKTNVKAAW<mask>KVGAHAGEYGAEALER")

Downstream Use

Extract Features

Here is how to use this model to get the features of a given sequence in PyTorch:

from multimolecule import ProteinTokenizer, ProteinBertModel


tokenizer = ProteinTokenizer.from_pretrained("multimolecule/proteinbert")
model = ProteinBertModel.from_pretrained("multimolecule/proteinbert")

text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER"
input = tokenizer(text, return_tensors="pt")

output = model(**input)

Sequence Classification / Regression

This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

import torch
from multimolecule import ProteinTokenizer, ProteinBertForSequencePrediction


tokenizer = ProteinTokenizer.from_pretrained("multimolecule/proteinbert")
model = ProteinBertForSequencePrediction.from_pretrained("multimolecule/proteinbert")

text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER"
input = tokenizer(text, return_tensors="pt")
label = torch.tensor([1])

output = model(**input, labels=label)

Token Classification / Regression

This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for token classification or regression.

Here is how to use this model as backbone to fine-tune for a residue-level task in PyTorch:

import torch
from multimolecule import ProteinTokenizer, ProteinBertForTokenPrediction


tokenizer = ProteinTokenizer.from_pretrained("multimolecule/proteinbert")
model = ProteinBertForTokenPrediction.from_pretrained("multimolecule/proteinbert")

text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER"
input = tokenizer(text, return_tensors="pt")
label = torch.randint(2, (1, len(text)))

output = model(**input, labels=label)

Training Details

Training Data

ProteinBERT is pre-trained on approximately 106 million protein sequences from UniRef90 and Gene Ontology annotations.

Training Procedure

ProteinBERT is trained with a combined objective over masked protein sequence recovery and Gene Ontology annotation prediction. Please refer to the original paper for details on the training setup.

Citation

@article{brandes2022proteinbert,
  title   = {ProteinBERT: a universal deep-learning model of protein sequence and function},
  author  = {Brandes, Nadav and Ofer, Dan and Peleg, Yam and Rappoport, Nadav and Linial, Michal},
  year    = {2022},
  journal = {Bioinformatics},
  volume  = {38},
  number  = {8},
  pages   = {2102--2110},
  doi     = {10.1093/bioinformatics/btac020},
  url     = {https://doi.org/10.1093/bioinformatics/btac020},
}

The artifacts distributed in this repository are part of the MultiMolecule project. If MultiMolecule supports your research, please cite the MultiMolecule project as follows:

@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}

Contact

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the ProteinBERT paper for questions or comments on the paper/model.

License

This model implementation is licensed under the GNU Affero General Public License.

For additional terms and clarifications, please refer to our License FAQ.

SPDX-License-Identifier: AGPL-3.0-or-later
Downloads last month
-
Safetensors
Model size
16M params
Tensor type
F32
·
Inference Examples
Examples
Mask token: <mask>
D
0.202
B
0.139
N
0.095
F
0.089
I
0.073
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support