Update README.md
Browse files
README.md
CHANGED
@@ -15,15 +15,15 @@
|
|
15 |
|
16 |
## Summary
|
17 |
|
18 |
-
The HIV-BERT model was trained as a refinement of the ProtBert-BFD model (https://huggingface.co/Rostlab/prot_bert_bfd) for HIV centric tasks. It was refined with whole viral genomes from the Los Alamos HIV Sequence Database (https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html). This pretraining is important for HIV related tasks as the original BFD database contains few viral proteins making it sub-optimal when used as the basis for transfer learning tasks. This model and other related HIV prediction tasks have been published
|
19 |
|
20 |
## Model Description
|
21 |
|
22 |
-
Like the original ProtBert-BFD model, this model encodes each amino acid as an individual token. This model was trained using Masked Language Modeling: a process in which a random set of tokens are masked with the model trained on their prediction. This model was trained using the damlab/hiv_flt dataset with 256 amino acid chunks and a 15% mask rate.
|
23 |
|
24 |
## Intended Uses & Limitations
|
25 |
|
26 |
-
As a masked language model this tool can be used to predict expected mutations using a masking approach. This could be used to identify highly mutated sequences, sequencing artifacts, or other contexts. As a BERT model, this tool can also be used as the base for transfer learning. This pretrained model could be used as the base when developing HIV-specific classification tasks.
|
27 |
|
28 |
## How to use
|
29 |
|
@@ -31,17 +31,17 @@ As a masked language model this tool can be used to predict expected mutations u
|
|
31 |
|
32 |
## Training Data
|
33 |
|
34 |
-
The dataset damlab/HIV_FLT was used to refine the original rostlab/Prot-bert-bfd. This dataset contains 1790 full HIV genomes from across the globe. When translated, these genomes contain approximately 3.9 million amino-acid tokens.
|
35 |
|
36 |
## Training Procedure
|
37 |
|
38 |
### Preprocessing
|
39 |
|
40 |
-
As with the rostlab/Prot-bert-bfd model, the rare amino acids U, Z, O, and B were converted to X and spaces were added between each amino acid. All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.
|
41 |
|
42 |
### Training
|
43 |
|
44 |
-
Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.
|
45 |
|
46 |
## Evaluation Results
|
47 |
|
@@ -50,7 +50,6 @@ Training was performed with the HuggingFace training module using the MaskedLM d
|
|
50 |
## BibTeX Entry and Citation Info
|
51 |
|
52 |
[More Information Needed]
|
53 |
-
|
54 |
---
|
55 |
license: mit
|
56 |
---
|
|
|
15 |
|
16 |
## Summary
|
17 |
|
18 |
+
[The HIV-BERT model was trained as a refinement of the ProtBert-BFD model (https://huggingface.co/Rostlab/prot_bert_bfd) for HIV centric tasks. It was refined with whole viral genomes from the Los Alamos HIV Sequence Database (https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html). This pretraining is important for HIV related tasks as the original BFD database contains few viral proteins making it sub-optimal when used as the basis for transfer learning tasks. This model and other related HIV prediction tasks have been published (link).]
|
19 |
|
20 |
## Model Description
|
21 |
|
22 |
+
[Like the original ProtBert-BFD model, this model encodes each amino acid as an individual token. This model was trained using Masked Language Modeling: a process in which a random set of tokens are masked with the model trained on their prediction. This model was trained using the damlab/hiv_flt dataset with 256 amino acid chunks and a 15% mask rate.]
|
23 |
|
24 |
## Intended Uses & Limitations
|
25 |
|
26 |
+
[As a masked language model this tool can be used to predict expected mutations using a masking approach. This could be used to identify highly mutated sequences, sequencing artifacts, or other contexts. As a BERT model, this tool can also be used as the base for transfer learning. This pretrained model could be used as the base when developing HIV-specific classification tasks.]
|
27 |
|
28 |
## How to use
|
29 |
|
|
|
31 |
|
32 |
## Training Data
|
33 |
|
34 |
+
[The dataset damlab/HIV_FLT was used to refine the original rostlab/Prot-bert-bfd. This dataset contains 1790 full HIV genomes from across the globe. When translated, these genomes contain approximately 3.9 million amino-acid tokens.]
|
35 |
|
36 |
## Training Procedure
|
37 |
|
38 |
### Preprocessing
|
39 |
|
40 |
+
[As with the rostlab/Prot-bert-bfd model, the rare amino acids U, Z, O, and B were converted to X and spaces were added between each amino acid. All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.]
|
41 |
|
42 |
### Training
|
43 |
|
44 |
+
[Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.]
|
45 |
|
46 |
## Evaluation Results
|
47 |
|
|
|
50 |
## BibTeX Entry and Citation Info
|
51 |
|
52 |
[More Information Needed]
|
|
|
53 |
---
|
54 |
license: mit
|
55 |
---
|