ProtBert-IS
Model Description
ProtBert-IS is a a model fine-tuned on the pre-trained ProtBert model for the purpose of sequence classification. It takes a protein sequence input and predicts whether the protein is soluble or insoluble. ProtBert-IS has been fine-tuned using 3 different training datasets.
Finetuned from model: Rostlab/prot_bert
GitHub repository with relevant files: https://github.com/VitaRin/ProtBert-IS
Uses
It can be directly used with the pipeline on singular sequences:
from transformers import BertModel, BertTokenizer
import re
pipeline = TextClassificationPipeline(
model=AutoModelForSequenceClassification.from_pretrained("VitaRin/ProtBert-IS"),
tokenizer=AutoTokenizer.from_pretrained("VitaRin/ProtBert-IS"),
device=0
)
sequence = "A E T C Z A O"
sequence = re.sub(r"[UZOB]", "X", sequence)
output = pipeline(sequence)
Or read multiple sequences from a .fasta file:
import re
pipeline = TextClassificationPipeline(
model=AutoModelForSequenceClassification.from_pretrained("VitaRin/ProtBert-IS"),
tokenizer=AutoTokenizer.from_pretrained("VitaRin/ProtBert-IS"),
device=0
)
with open("input.fasta", "r") as f:
data = f.read().split(">")
data.remove(data[0])
sequences = []
for d in data:
d = d.split('\n', 1)[-1].replace('\n', '').replace('', ' ')
sequences.append(d)
sequences = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences]
print(pipeline(sequences))