damlab commited on
Commit
7b1ce13
1 Parent(s): dd20c76

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -7
README.md CHANGED
@@ -15,15 +15,15 @@
15
 
16
  ## Summary
17
 
18
- The HIV-BERT model was trained as a refinement of the ProtBert-BFD model (https://huggingface.co/Rostlab/prot_bert_bfd) for HIV centric tasks. It was refined with whole viral genomes from the Los Alamos HIV Sequence Database (https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html). This pretraining is important for HIV related tasks as the original BFD database contains few viral proteins making it sub-optimal when used as the basis for transfer learning tasks. This model and other related HIV prediction tasks have been published [link].
19
 
20
  ## Model Description
21
 
22
- Like the original ProtBert-BFD model, this model encodes each amino acid as an individual token. This model was trained using Masked Language Modeling: a process in which a random set of tokens are masked with the model trained on their prediction. This model was trained using the damlab/hiv_flt dataset with 256 amino acid chunks and a 15% mask rate.
23
 
24
  ## Intended Uses & Limitations
25
 
26
- As a masked language model this tool can be used to predict expected mutations using a masking approach. This could be used to identify highly mutated sequences, sequencing artifacts, or other contexts. As a BERT model, this tool can also be used as the base for transfer learning. This pretrained model could be used as the base when developing HIV-specific classification tasks.
27
 
28
  ## How to use
29
 
@@ -31,17 +31,17 @@ As a masked language model this tool can be used to predict expected mutations u
31
 
32
  ## Training Data
33
 
34
- The dataset damlab/HIV_FLT was used to refine the original rostlab/Prot-bert-bfd. This dataset contains 1790 full HIV genomes from across the globe. When translated, these genomes contain approximately 3.9 million amino-acid tokens.
35
 
36
  ## Training Procedure
37
 
38
  ### Preprocessing
39
 
40
- As with the rostlab/Prot-bert-bfd model, the rare amino acids U, Z, O, and B were converted to X and spaces were added between each amino acid. All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.
41
 
42
  ### Training
43
 
44
- Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.
45
 
46
  ## Evaluation Results
47
 
@@ -50,7 +50,6 @@ Training was performed with the HuggingFace training module using the MaskedLM d
50
  ## BibTeX Entry and Citation Info
51
 
52
  [More Information Needed]
53
-
54
  ---
55
  license: mit
56
  ---
 
15
 
16
  ## Summary
17
 
18
+ [The HIV-BERT model was trained as a refinement of the ProtBert-BFD model (https://huggingface.co/Rostlab/prot_bert_bfd) for HIV centric tasks. It was refined with whole viral genomes from the Los Alamos HIV Sequence Database (https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html). This pretraining is important for HIV related tasks as the original BFD database contains few viral proteins making it sub-optimal when used as the basis for transfer learning tasks. This model and other related HIV prediction tasks have been published (link).]
19
 
20
  ## Model Description
21
 
22
+ [Like the original ProtBert-BFD model, this model encodes each amino acid as an individual token. This model was trained using Masked Language Modeling: a process in which a random set of tokens are masked with the model trained on their prediction. This model was trained using the damlab/hiv_flt dataset with 256 amino acid chunks and a 15% mask rate.]
23
 
24
  ## Intended Uses & Limitations
25
 
26
+ [As a masked language model this tool can be used to predict expected mutations using a masking approach. This could be used to identify highly mutated sequences, sequencing artifacts, or other contexts. As a BERT model, this tool can also be used as the base for transfer learning. This pretrained model could be used as the base when developing HIV-specific classification tasks.]
27
 
28
  ## How to use
29
 
 
31
 
32
  ## Training Data
33
 
34
+ [The dataset damlab/HIV_FLT was used to refine the original rostlab/Prot-bert-bfd. This dataset contains 1790 full HIV genomes from across the globe. When translated, these genomes contain approximately 3.9 million amino-acid tokens.]
35
 
36
  ## Training Procedure
37
 
38
  ### Preprocessing
39
 
40
+ [As with the rostlab/Prot-bert-bfd model, the rare amino acids U, Z, O, and B were converted to X and spaces were added between each amino acid. All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.]
41
 
42
  ### Training
43
 
44
+ [Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.]
45
 
46
  ## Evaluation Results
47
 
 
50
  ## BibTeX Entry and Citation Info
51
 
52
  [More Information Needed]
 
53
  ---
54
  license: mit
55
  ---