File size: 3,068 Bytes
3aeb4c6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7b1ce13
3aeb4c6
 
 
7b1ce13
3aeb4c6
 
 
7b1ce13
3aeb4c6
 
 
 
 
 
 
7b1ce13
3aeb4c6
 
 
 
 
7b1ce13
3aeb4c6
 
 
7b1ce13
3aeb4c6
 
 
 
 
 
 
 
d35b737
 
3aeb4c6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# Model Card for [HIV_BERT]



## Table of Contents

- [Table of Contents](#table-of-contents)

- [Summary](#model-summary)

- [Model Description](#model-description)

- [Intended Uses & Limitations](#intended-uses-&-limitations)

- [How to Use](#how-to-use)

- [Training Data](#training-data)

- [Training Procedure](#training-procedure)

  - [Preprocessing](#preprocessing)

  - [Training](#training)

- [Evaluation Results](#evaluation-results)

- [BibTeX Entry and Citation Info](#bibtex-entry-and-citation-info)



## Summary



[The HIV-BERT model was trained as a refinement of the ProtBert-BFD model (https://huggingface.co/Rostlab/prot_bert_bfd) for HIV centric tasks. It was refined with whole viral genomes from the Los Alamos HIV Sequence Database (https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html). This pretraining is important for HIV related tasks as the original BFD database contains few viral proteins making it sub-optimal when used as the basis for transfer learning tasks. This model and other related HIV prediction tasks have been published (link).]



## Model Description



[Like the original ProtBert-BFD model, this model encodes each amino acid as an individual token. This model was trained using Masked Language Modeling: a process in which a random set of tokens are masked with the model trained on their prediction. This model was trained using the damlab/hiv_flt dataset with 256 amino acid chunks and a 15% mask rate.]

## Intended Uses & Limitations

[As a masked language model this tool can be used to predict expected mutations using a masking approach. This could be used to identify highly mutated sequences, sequencing artifacts, or other contexts. As a BERT model, this tool can also be used as the base for transfer learning. This pretrained model could be used as the base when developing HIV-specific classification tasks.]

## How to use

[Code snippet of AutoModelForMaskedLM prediction of V3 amino acids.]

## Training Data

[The dataset damlab/HIV_FLT was used to refine the original rostlab/Prot-bert-bfd. This dataset contains 1790 full HIV genomes from across the globe. When translated, these genomes contain approximately 3.9 million amino-acid tokens.]



## Training Procedure



### Preprocessing



[As with the rostlab/Prot-bert-bfd model, the rare amino acids U, Z, O, and B were converted to X and spaces were added between each amino acid. All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.]



### Training



[Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.]



## Evaluation Results



[Table of Prot-Bert and HIV-Bert loss on HIV sequence datasets]



## BibTeX Entry and Citation Info



[More Information Needed]

---

license: mit

---