## ProtBert-BDF-IS ### Model Description ProtBert-BFD-IS is a a model fine-tuned on the pre-trained ProtBert-BFD model for the purpose of sequence classification. It takes a protein sequence input and predicts whether the protein is soluble or insoluble. ProtBert-BFD-IS has been fine-tuned using 3 different training datasets. **Finetuned from model:** Rostlab/prot_bert_bfd GitHub repository with relevant files: https://github.com/VitaRin/ProtBert-IS ## Uses It can be directly used with the pipeline on singular sequences: ``` from transformers import BertModel, BertTokenizer import re pipeline = TextClassificationPipeline( model=AutoModelForSequenceClassification.from_pretrained("VitaRin/ProtBert-IS"), tokenizer=AutoTokenizer.from_pretrained("VitaRin/ProtBert-IS"), device=0 ) sequence = "A E T C Z A O" sequence = re.sub(r"[UZOB]", "X", sequence) output = pipeline(sequence) ``` Or read multiple sequences from a .fasta file: ```from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline import re pipeline = TextClassificationPipeline( model=AutoModelForSequenceClassification.from_pretrained("VitaRin/ProtBert-IS"), tokenizer=AutoTokenizer.from_pretrained("VitaRin/ProtBert-IS"), device=0 ) with open("input.fasta", "r") as f: data = f.read().split(">") data.remove(data[0]) sequences = [] for d in data: d = d.split('\n', 1)[-1].replace('\n', '').replace('', ' ') sequences.append(d) sequences = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences] print(pipeline(sequences)) ```