willdampier
commited on
Commit
•
38e5bb9
1
Parent(s):
1ccfdd4
adding better examples to the model card
Browse files
README.md
CHANGED
@@ -1,11 +1,18 @@
|
|
1 |
---
|
2 |
license: mit
|
3 |
|
|
|
|
|
|
|
|
|
|
|
4 |
widget:
|
5 |
- text: 'C T R P N N N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C'
|
6 |
-
|
7 |
-
- text: '
|
8 |
-
|
|
|
|
|
9 |
|
10 |
---
|
11 |
|
@@ -31,13 +38,58 @@ The HIV-BERT model was trained as a refinement of the [ProtBert-BFD model](https
|
|
31 |
|
32 |
Like the original [ProtBert-BFD model](https://huggingface.co/Rostlab/prot_bert_bfd), this model encodes each amino acid as an individual token. This model was trained using Masked Language Modeling: a process in which a random set of tokens are masked with the model trained on their prediction. This model was trained using the damlab/hiv-flt dataset with 256 amino acid chunks and a 15% mask rate.
|
33 |
|
|
|
|
|
|
|
34 |
## Intended Uses & Limitations
|
35 |
|
36 |
As a masked language model this tool can be used to predict expected mutations using a masking approach. This could be used to identify highly mutated sequences, sequencing artifacts, or other contexts. As a BERT model, this tool can also be used as the base for transfer learning. This pretrained model could be used as the base when developing HIV-specific classification tasks.
|
37 |
|
38 |
## How to use
|
39 |
|
40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
|
42 |
## Training Data
|
43 |
|
@@ -53,9 +105,6 @@ As with the [rostlab/Prot-bert-bfd](https://huggingface.co/Rostlab/prot_bert_bfd
|
|
53 |
|
54 |
Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.
|
55 |
|
56 |
-
## Evaluation Results
|
57 |
-
|
58 |
-
[Table of Prot-Bert and HIV-Bert loss on HIV sequence datasets]
|
59 |
|
60 |
## BibTeX Entry and Citation Info
|
61 |
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
|
4 |
+
datasets:
|
5 |
+
- damlab/HIV_FLT
|
6 |
+
metrics:
|
7 |
+
- accuracy
|
8 |
+
|
9 |
widget:
|
10 |
- text: 'C T R P N N N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C'
|
11 |
+
example_title: 'V3'
|
12 |
+
- text: 'M E P V D P R L E P W K H P G S Q P K T A C T N C Y C K K C C F H C Q V C F I T K A L G I S Y G R K K R R Q R R R A H Q N S Q T H Q A S L S K Q P T S Q P R G D P T G P K E S K K K V E R E T E T D P F D'
|
13 |
+
example_title: 'Tat'
|
14 |
+
- text: 'P Q I T L W Q R P L V T I K I G G Q L K E A L L D T G A D D T V L E E M N L P G R W K P K M I G G I G G F I K V R Q Y D Q I L I E I C G H K A I G T V L V G P T P V N I I G R N L L T Q I G C T L N F'
|
15 |
+
example_title: 'PR'
|
16 |
|
17 |
---
|
18 |
|
|
|
38 |
|
39 |
Like the original [ProtBert-BFD model](https://huggingface.co/Rostlab/prot_bert_bfd), this model encodes each amino acid as an individual token. This model was trained using Masked Language Modeling: a process in which a random set of tokens are masked with the model trained on their prediction. This model was trained using the damlab/hiv-flt dataset with 256 amino acid chunks and a 15% mask rate.
|
40 |
|
41 |
+
|
42 |
+
|
43 |
+
|
44 |
## Intended Uses & Limitations
|
45 |
|
46 |
As a masked language model this tool can be used to predict expected mutations using a masking approach. This could be used to identify highly mutated sequences, sequencing artifacts, or other contexts. As a BERT model, this tool can also be used as the base for transfer learning. This pretrained model could be used as the base when developing HIV-specific classification tasks.
|
47 |
|
48 |
## How to use
|
49 |
|
50 |
+
As this is a BERT-style Masked Language learner, it can be used to determine the most likely amino acid at a masked position.
|
51 |
+
|
52 |
+
```python
|
53 |
+
from transformers import pipeline
|
54 |
+
|
55 |
+
unmasker = pipeline("fill-mask", model="damlab/HIV_FLT")
|
56 |
+
|
57 |
+
unmasker(f"C T R P N [MASK] N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C")
|
58 |
+
|
59 |
+
[
|
60 |
+
{
|
61 |
+
"score": 0.9581968188285828,
|
62 |
+
"token": 17,
|
63 |
+
"token_str": "N",
|
64 |
+
"sequence": "C T R P N N N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
|
65 |
+
},
|
66 |
+
{
|
67 |
+
"score": 0.022986575961112976,
|
68 |
+
"token": 12,
|
69 |
+
"token_str": "K",
|
70 |
+
"sequence": "C T R P N K N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
|
71 |
+
},
|
72 |
+
{
|
73 |
+
"score": 0.003997281193733215,
|
74 |
+
"token": 14,
|
75 |
+
"token_str": "D",
|
76 |
+
"sequence": "C T R P N D N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
|
77 |
+
},
|
78 |
+
{
|
79 |
+
"score": 0.003636382520198822,
|
80 |
+
"token": 15,
|
81 |
+
"token_str": "T",
|
82 |
+
"sequence": "C T R P N T N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
|
83 |
+
},
|
84 |
+
{
|
85 |
+
"score": 0.002701344434171915,
|
86 |
+
"token": 10,
|
87 |
+
"token_str": "S",
|
88 |
+
"sequence": "C T R P N S N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
|
89 |
+
}
|
90 |
+
]
|
91 |
+
|
92 |
+
```
|
93 |
|
94 |
## Training Data
|
95 |
|
|
|
105 |
|
106 |
Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.
|
107 |
|
|
|
|
|
|
|
108 |
|
109 |
## BibTeX Entry and Citation Info
|
110 |
|