--- license: mit widget: - text: 'C T R P N N N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C' - text: 'C T R P N N N T R K S I H I G P G R A F Y T T G Q I I G D I R Q A Y C' - text: 'C T R P N N N T R R S I R I G P G Q A F Y A T G D I I G D I R Q A H C' - text: 'C G R P N N H R I K G L R I G P G R A F F A M G A I G G G E I R Q A H C' --- # HIV_BERT model ## Table of Contents - [Summary](#model-summary) - [Model Description](#model-description) - [Intended Uses & Limitations](#intended-uses-&-limitations) - [How to Use](#how-to-use) - [Training Data](#training-data) - [Training Procedure](#training-procedure) - [Preprocessing](#preprocessing) - [Training](#training) - [Evaluation Results](#evaluation-results) - [BibTeX Entry and Citation Info](#bibtex-entry-and-citation-info) ## Summary The HIV-BERT model was trained as a refinement of the [ProtBert-BFD model](https://huggingface.co/Rostlab/prot_bert_bfd) for HIV centric tasks. It was refined with whole viral genomes from the [Los Alamos HIV Sequence Database](https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html). This pretraining is important for HIV related tasks as the original BFD database contains few viral proteins making it sub-optimal when used as the basis for transfer learning tasks. This model and other related HIV prediction tasks have been published (link). ## Model Description Like the original [ProtBert-BFD model](https://huggingface.co/Rostlab/prot_bert_bfd), this model encodes each amino acid as an individual token. This model was trained using Masked Language Modeling: a process in which a random set of tokens are masked with the model trained on their prediction. This model was trained using the damlab/hiv-flt dataset with 256 amino acid chunks and a 15% mask rate. ## Intended Uses & Limitations As a masked language model this tool can be used to predict expected mutations using a masking approach. This could be used to identify highly mutated sequences, sequencing artifacts, or other contexts. As a BERT model, this tool can also be used as the base for transfer learning. This pretrained model could be used as the base when developing HIV-specific classification tasks. ## How to use [Code snippet of AutoModelForMaskedLM prediction of V3 amino acids.] ## Training Data The dataset [damlab/HIV_FLT](https://huggingface.co/datasets/damlab/HIV_FLT) was used to refine the original [rostlab/Prot-bert-bfd](https://huggingface.co/Rostlab/prot_bert_bfd). This dataset contains 1790 full HIV genomes from across the globe. When translated, these genomes contain approximately 3.9 million amino-acid tokens. ## Training Procedure ### Preprocessing As with the [rostlab/Prot-bert-bfd](https://huggingface.co/Rostlab/prot_bert_bfd) model, the rare amino acids U, Z, O, and B were converted to X and spaces were added between each amino acid. All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation. ### Training Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset. ## Evaluation Results [Table of Prot-Bert and HIV-Bert loss on HIV sequence datasets] ## BibTeX Entry and Citation Info [More Information Needed]