File size: 4,935 Bytes
01d9a06
 
3befc7f
38e5bb9
 
 
 
 
3befc7f
 
38e5bb9
 
 
 
 
3befc7f
01d9a06
 
1ccfdd4
3aeb4c6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3e61765
3aeb4c6
 
 
3aacc66
3aeb4c6
38e5bb9
 
 
3aeb4c6
 
9178ba6
3aeb4c6
 
 
38e5bb9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3aeb4c6
 
 
93acdc5
3aeb4c6
 
 
 
 
93acdc5
3aeb4c6
 
 
9178ba6
3aeb4c6
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---

license: mit

datasets:
  - damlab/HIV_FLT
metrics:
  - accuracy

widget:
 - text: 'C T R P N N N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C'
   example_title: 'V3'
 - text: 'M E P V D P R L E P W K H P G S Q P K T A C T N C Y C K K C C F H C Q V C F I T K A L G I S Y G R K K R R Q R R R A H Q N S Q T H Q A S L S K Q P T S Q P R G D P T G P K E S K K K V E R E T E T D P F D' 
   example_title: 'Tat'
 - text: 'P Q I T L W Q R P L V T I K I G G Q L K E A L L D T G A D D T V L E E M N L P G R W K P K M I G G I G G F I K V R Q Y D Q I L I E I C G H K A I G T V L V G P T P V N I I G R N L L T Q I G C T L N F'
   example_title: 'PR'

---


# HIV_BERT model



## Table of Contents

- [Summary](#model-summary)

- [Model Description](#model-description)

- [Intended Uses & Limitations](#intended-uses-&-limitations)

- [How to Use](#how-to-use)

- [Training Data](#training-data)

- [Training Procedure](#training-procedure)

  - [Preprocessing](#preprocessing)

  - [Training](#training)

- [Evaluation Results](#evaluation-results)

- [BibTeX Entry and Citation Info](#bibtex-entry-and-citation-info)



## Summary



The HIV-BERT model was trained as a refinement of the [ProtBert-BFD model](https://huggingface.co/Rostlab/prot_bert_bfd) for HIV centric tasks. It was refined with whole viral genomes from the [Los Alamos HIV Sequence Database](https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html). This pretraining is important for HIV related tasks as the original BFD database contains few viral proteins making it sub-optimal when used as the basis for transfer learning tasks. This model and other related HIV prediction tasks have been published (link).



## Model Description



Like the original [ProtBert-BFD model](https://huggingface.co/Rostlab/prot_bert_bfd), this model encodes each amino acid as an individual token. This model was trained using Masked Language Modeling: a process in which a random set of tokens are masked with the model trained on their prediction. This model was trained using the damlab/hiv-flt dataset with 256 amino acid chunks and a 15% mask rate.









## Intended Uses & Limitations



As a masked language model this tool can be used to predict expected mutations using a masking approach. This could be used to identify highly mutated sequences, sequencing artifacts, or other contexts. As a BERT model, this tool can also be used as the base for transfer learning. This pretrained model could be used as the base when developing HIV-specific classification tasks.



## How to use



As this is a BERT-style Masked Language learner, it can be used to determine the most likely amino acid at a masked position.



```python

from transformers import pipeline



unmasker = pipeline("fill-mask", model="damlab/HIV_FLT")

unmasker(f"C T R P N [MASK] N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C")

[
  {
    "score": 0.9581968188285828,

    "token": 17,

    "token_str": "N",

    "sequence": "C T R P N N N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"

  },

  {

    "score": 0.022986575961112976,

    "token": 12,

    "token_str": "K",

    "sequence": "C T R P N K N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"

  },

  {

    "score": 0.003997281193733215,

    "token": 14,

    "token_str": "D",

    "sequence": "C T R P N D N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"

  },

  {

    "score": 0.003636382520198822,

    "token": 15,

    "token_str": "T",

    "sequence": "C T R P N T N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"

  },

  {

    "score": 0.002701344434171915,

    "token": 10,

    "token_str": "S",

    "sequence": "C T R P N S N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"

  }

]


```



## Training Data



The dataset [damlab/HIV_FLT](https://huggingface.co/datasets/damlab/HIV_FLT) was used to refine the original [rostlab/Prot-bert-bfd](https://huggingface.co/Rostlab/prot_bert_bfd). This dataset contains 1790 full HIV genomes from across the globe. When translated, these genomes contain approximately 3.9 million amino-acid tokens.



## Training Procedure



### Preprocessing



As with the [rostlab/Prot-bert-bfd](https://huggingface.co/Rostlab/prot_bert_bfd) model, the rare amino acids U, Z, O, and B were converted to X and spaces were added between each amino acid. All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.



### Training



Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.





## BibTeX Entry and Citation Info



[More Information Needed]