VitaRin commited on
Commit
5cc40dc
1 Parent(s): ae5f393

Created README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -0
README.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ## ProtBert-IS
3
+
4
+ ### Model Description
5
+
6
+ ProtBert-IS is a a model fine-tuned on the pre-trained ProtBert model for the purpose of sequence classification. It takes a protein sequence input and predicts whether the protein is soluble or insoluble.
7
+ ProtBert-IS has been fine-tuned using 3 different training datasets.
8
+
9
+ **Finetuned from model:** Rostlab/prot_bert
10
+
11
+ GitHub repository with relevant files: https://github.com/VitaRin/ProtBert-IS
12
+
13
+ ## Uses
14
+
15
+ It can be directly used with the pipeline on singular sequences:
16
+
17
+ ```
18
+ from transformers import BertModel, BertTokenizer
19
+ import re
20
+
21
+ pipeline = TextClassificationPipeline(
22
+ model=AutoModelForSequenceClassification.from_pretrained("VitaRin/ProtBert-IS"),
23
+ tokenizer=AutoTokenizer.from_pretrained("VitaRin/ProtBert-IS"),
24
+ device=0
25
+ )
26
+ sequence = "A E T C Z A O"
27
+ sequence = re.sub(r"[UZOB]", "X", sequence)
28
+ output = pipeline(sequence)
29
+ ```
30
+
31
+ Or read multiple sequences from a .fasta file:
32
+
33
+ ```from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline
34
+ import re
35
+
36
+ pipeline = TextClassificationPipeline(
37
+ model=AutoModelForSequenceClassification.from_pretrained("VitaRin/ProtBert-IS"),
38
+ tokenizer=AutoTokenizer.from_pretrained("VitaRin/ProtBert-IS"),
39
+ device=0
40
+ )
41
+
42
+ with open("input.fasta", "r") as f:
43
+ data = f.read().split(">")
44
+
45
+ data.remove(data[0])
46
+ sequences = []
47
+
48
+ for d in data:
49
+ d = d.split('\n', 1)[-1].replace('\n', '').replace('', ' ')
50
+ sequences.append(d)
51
+
52
+ sequences = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences]
53
+ print(pipeline(sequences))
54
+ ```
55
+