ProtBERT-Unmasking

This model is a fine-tuned version of ProtBERT specifically optimized for unmasking protein sequences. It can predict masked amino acids in protein sequences based on the surrounding context.

Model Description

  • Base Model: ProtBERT
  • Task: Protein Sequence Unmasking
  • Training: Fine-tuned on masked protein sequences
  • Use Case: Predicting missing or masked amino acids in protein sequences
  • Optimal Use: Best performance on E. coli sequences with known amino acids K, C, Y, H, S, M

For detailed information about the training methodology and approach, please refer to our paper: https://arxiv.org/abs/2408.00892

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("your-username/protbert-sequence-unmasking")
tokenizer = AutoTokenizer.from_pretrained("your-username/protbert-sequence-unmasking")

# Example usage for E. coli sequence with known amino acids (K,C,Y,H,S,M)
sequence = "MALN[MASK]KFGP[MASK]LVRK"
inputs = tokenizer(sequence, return_tensors="pt")
outputs = model(**inputs)
predictions = outputs.logits

Inference API

The model is optimized for:

  • Organism: E. coli
  • Known Amino Acids: K, C, Y, H, S, M
  • Task: Predicting unknown amino acids in a sequence

Example API usage:

from transformers import pipeline

unmasker = pipeline('fill-mask', model='your-username/protbert-sequence-unmasking')
sequence = "K[MASK]YHS[MASK]"  # Example with known amino acids K,Y,H,S
results = unmasker(sequence)

for result in results:
    print(f"Predicted amino acid: {result['token_str']}, Score: {result['score']:.3f}")

Limitations and Biases

  • This model is specifically designed for protein sequence unmasking in E. coli
  • Optimal performance is achieved when working with sequences containing known amino acids K, C, Y, H, S, M
  • The model may not perform optimally for:
    • Sequences from other organisms
    • Sequences without the specified known amino acids
    • Other protein-related tasks

Training Details

The complete details of the training methodology, dataset preparation, and model evaluation can be found in our paper: https://arxiv.org/abs/2408.00892

Downloads last month
16
Safetensors
Model size
420M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.