willdampier commited on
Commit
38e5bb9
1 Parent(s): 1ccfdd4

adding better examples to the model card

Browse files
Files changed (1) hide show
  1. README.md +56 -7
README.md CHANGED
@@ -1,11 +1,18 @@
1
  ---
2
  license: mit
3
 
 
 
 
 
 
4
  widget:
5
  - text: 'C T R P N N N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C'
6
- - text: 'C T R P N N N T R K S I H I G P G R A F Y T T G Q I I G D I R Q A Y C'
7
- - text: 'C T R P N N N T R R S I R I G P G Q A F Y A T G D I I G D I R Q A H C'
8
- - text: 'C G R P N N H R I K G L R I G P G R A F F A M G A I G G G E I R Q A H C'
 
 
9
 
10
  ---
11
 
@@ -31,13 +38,58 @@ The HIV-BERT model was trained as a refinement of the [ProtBert-BFD model](https
31
 
32
  Like the original [ProtBert-BFD model](https://huggingface.co/Rostlab/prot_bert_bfd), this model encodes each amino acid as an individual token. This model was trained using Masked Language Modeling: a process in which a random set of tokens are masked with the model trained on their prediction. This model was trained using the damlab/hiv-flt dataset with 256 amino acid chunks and a 15% mask rate.
33
 
 
 
 
34
  ## Intended Uses & Limitations
35
 
36
  As a masked language model this tool can be used to predict expected mutations using a masking approach. This could be used to identify highly mutated sequences, sequencing artifacts, or other contexts. As a BERT model, this tool can also be used as the base for transfer learning. This pretrained model could be used as the base when developing HIV-specific classification tasks.
37
 
38
  ## How to use
39
 
40
- [Code snippet of AutoModelForMaskedLM prediction of V3 amino acids.]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
  ## Training Data
43
 
@@ -53,9 +105,6 @@ As with the [rostlab/Prot-bert-bfd](https://huggingface.co/Rostlab/prot_bert_bfd
53
 
54
  Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.
55
 
56
- ## Evaluation Results
57
-
58
- [Table of Prot-Bert and HIV-Bert loss on HIV sequence datasets]
59
 
60
  ## BibTeX Entry and Citation Info
61
 
 
1
  ---
2
  license: mit
3
 
4
+ datasets:
5
+ - damlab/HIV_FLT
6
+ metrics:
7
+ - accuracy
8
+
9
  widget:
10
  - text: 'C T R P N N N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C'
11
+ example_title: 'V3'
12
+ - text: 'M E P V D P R L E P W K H P G S Q P K T A C T N C Y C K K C C F H C Q V C F I T K A L G I S Y G R K K R R Q R R R A H Q N S Q T H Q A S L S K Q P T S Q P R G D P T G P K E S K K K V E R E T E T D P F D'
13
+ example_title: 'Tat'
14
+ - text: 'P Q I T L W Q R P L V T I K I G G Q L K E A L L D T G A D D T V L E E M N L P G R W K P K M I G G I G G F I K V R Q Y D Q I L I E I C G H K A I G T V L V G P T P V N I I G R N L L T Q I G C T L N F'
15
+ example_title: 'PR'
16
 
17
  ---
18
 
 
38
 
39
  Like the original [ProtBert-BFD model](https://huggingface.co/Rostlab/prot_bert_bfd), this model encodes each amino acid as an individual token. This model was trained using Masked Language Modeling: a process in which a random set of tokens are masked with the model trained on their prediction. This model was trained using the damlab/hiv-flt dataset with 256 amino acid chunks and a 15% mask rate.
40
 
41
+
42
+
43
+
44
  ## Intended Uses & Limitations
45
 
46
  As a masked language model this tool can be used to predict expected mutations using a masking approach. This could be used to identify highly mutated sequences, sequencing artifacts, or other contexts. As a BERT model, this tool can also be used as the base for transfer learning. This pretrained model could be used as the base when developing HIV-specific classification tasks.
47
 
48
  ## How to use
49
 
50
+ As this is a BERT-style Masked Language learner, it can be used to determine the most likely amino acid at a masked position.
51
+
52
+ ```python
53
+ from transformers import pipeline
54
+
55
+ unmasker = pipeline("fill-mask", model="damlab/HIV_FLT")
56
+
57
+ unmasker(f"C T R P N [MASK] N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C")
58
+
59
+ [
60
+ {
61
+ "score": 0.9581968188285828,
62
+ "token": 17,
63
+ "token_str": "N",
64
+ "sequence": "C T R P N N N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
65
+ },
66
+ {
67
+ "score": 0.022986575961112976,
68
+ "token": 12,
69
+ "token_str": "K",
70
+ "sequence": "C T R P N K N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
71
+ },
72
+ {
73
+ "score": 0.003997281193733215,
74
+ "token": 14,
75
+ "token_str": "D",
76
+ "sequence": "C T R P N D N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
77
+ },
78
+ {
79
+ "score": 0.003636382520198822,
80
+ "token": 15,
81
+ "token_str": "T",
82
+ "sequence": "C T R P N T N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
83
+ },
84
+ {
85
+ "score": 0.002701344434171915,
86
+ "token": 10,
87
+ "token_str": "S",
88
+ "sequence": "C T R P N S N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
89
+ }
90
+ ]
91
+
92
+ ```
93
 
94
  ## Training Data
95
 
 
105
 
106
  Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.
107
 
 
 
 
108
 
109
  ## BibTeX Entry and Citation Info
110