HIV_BERT model

Summary
Model Description
Intended Uses & Limitations
How to Use
Training Data
Training Procedure
- Preprocessing
- Training
Evaluation Results
BibTeX Entry and Citation Info

Summary

The HIV-BERT model was trained as a refinement of the ProtBert-BFD model for HIV centric tasks. It was refined with whole viral genomes from the Los Alamos HIV Sequence Database. This pretraining is important for HIV related tasks as the original BFD database contains few viral proteins making it sub-optimal when used as the basis for transfer learning tasks. This model and other related HIV prediction tasks have been published (link).

Model Description

Like the original ProtBert-BFD model, this model encodes each amino acid as an individual token. This model was trained using Masked Language Modeling: a process in which a random set of tokens are masked with the model trained on their prediction. This model was trained using the damlab/hiv-flt dataset with 256 amino acid chunks and a 15% mask rate.

Intended Uses & Limitations

As a masked language model this tool can be used to predict expected mutations using a masking approach. This could be used to identify highly mutated sequences, sequencing artifacts, or other contexts. As a BERT model, this tool can also be used as the base for transfer learning. This pretrained model could be used as the base when developing HIV-specific classification tasks.

How to use

As this is a BERT-style Masked Language learner, it can be used to determine the most likely amino acid at a masked position.

from transformers import pipeline

unmasker = pipeline("fill-mask", model="damlab/HIV_FLT")

unmasker(f"C T R P N [MASK] N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C")

[
  {
    "score": 0.9581968188285828,
    "token": 17,
    "token_str": "N",
    "sequence": "C T R P N N N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
  },
  {
    "score": 0.022986575961112976,
    "token": 12,
    "token_str": "K",
    "sequence": "C T R P N K N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
  },
  {
    "score": 0.003997281193733215,
    "token": 14,
    "token_str": "D",
    "sequence": "C T R P N D N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
  },
  {
    "score": 0.003636382520198822,
    "token": 15,
    "token_str": "T",
    "sequence": "C T R P N T N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
  },
  {
    "score": 0.002701344434171915,
    "token": 10,
    "token_str": "S",
    "sequence": "C T R P N S N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
  }
]

Training Data

The dataset damlab/HIV_FLT was used to refine the original rostlab/Prot-bert-bfd. This dataset contains 1790 full HIV genomes from across the globe. When translated, these genomes contain approximately 3.9 million amino-acid tokens.

Training Procedure

Preprocessing

As with the rostlab/Prot-bert-bfd model, the rare amino acids U, Z, O, and B were converted to X and spaces were added between each amino acid. All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.

Training

Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.

BibTeX Entry and Citation Info

[More Information Needed]

damlab
/

HIV_BERT