File size: 1,598 Bytes
1cef631 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
## ProtBert-BDF-IS
### Model Description
ProtBert-BFD-IS is a a model fine-tuned on the pre-trained ProtBert-BFD model for the purpose of sequence classification. It takes a protein sequence input and predicts whether the protein is soluble or insoluble.
ProtBert-BFD-IS has been fine-tuned using 3 different training datasets.
**Finetuned from model:** Rostlab/prot_bert_bfd
GitHub repository with relevant files: https://github.com/VitaRin/ProtBert-IS
## Uses
It can be directly used with the pipeline on singular sequences:
```
from transformers import BertModel, BertTokenizer
import re
pipeline = TextClassificationPipeline(
model=AutoModelForSequenceClassification.from_pretrained("VitaRin/ProtBert-IS"),
tokenizer=AutoTokenizer.from_pretrained("VitaRin/ProtBert-IS"),
device=0
)
sequence = "A E T C Z A O"
sequence = re.sub(r"[UZOB]", "X", sequence)
output = pipeline(sequence)
```
Or read multiple sequences from a .fasta file:
```from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline
import re
pipeline = TextClassificationPipeline(
model=AutoModelForSequenceClassification.from_pretrained("VitaRin/ProtBert-IS"),
tokenizer=AutoTokenizer.from_pretrained("VitaRin/ProtBert-IS"),
device=0
)
with open("input.fasta", "r") as f:
data = f.read().split(">")
data.remove(data[0])
sequences = []
for d in data:
d = d.split('\n', 1)[-1].replace('\n', '').replace('', ' ')
sequences.append(d)
sequences = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences]
print(pipeline(sequences))
```
|