willdampier commited on
Commit
b888430
2 Parent(s): 7784411 a2dfe01

Merge branch 'main' of https://huggingface.co/damlab/HIV_V3_Coreceptor

Browse files
Files changed (1) hide show
  1. README.md +12 -7
README.md CHANGED
@@ -1,11 +1,16 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
3
  ---
4
 
5
- # Model Card for [HIV_V3_coreceptor]
6
 
7
  ## Table of Contents
8
- - [Table of Contents](#table-of-contents)
9
  - [Summary](#model-summary)
10
  - [Model Description](#model-description)
11
  - [Intended Uses & Limitations](#intended-uses-&-limitations)
@@ -19,7 +24,7 @@ license: mit
19
 
20
  ## Summary
21
 
22
- The HIV-BERT-Coreceptor model was trained as a refinement of the HIV-BERT model (insert link) and serves to better predict HIV V3 coreceptor tropism. HIV-BERT is a model refined from the ProtBert-BFD model (https://huggingface.co/Rostlab/prot_bert_bfd) to better fulfill HIV-centric tasks. This model was then trained using HIV V3 sequences from the Los Alamos HIV Sequence Database (https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html), allowing even more precise prediction of V3 coreceptor tropism than the HIV-BERT model can provide.
23
 
24
  ## Model Description
25
 
@@ -29,7 +34,7 @@ The HIV-BERT-Coreceptor model is intended to predict the Co-receptor tropism of
29
 
30
  This tool can be used as a predictor of HIV tropism from the Env-V3 loop. It can recognize both R5, X4, and dual tropic viruses natively. It should not be considered a clinical diagnostic tool.
31
 
32
- This tool was trained using the Los Alamos HIV sequence dataset (https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html). Due to the sampling nature of this database, it is predominantly composed of subtype B sequences from North America and Europe with only minor contributions of Subtype C, A, and D. Currently, there was no effort made to balance the performance across these classes. As such, one should consider refinement with additional sequences to perform well on non-B sequences.
33
 
34
  ## How to use
35
 
@@ -37,17 +42,17 @@ This tool was trained using the Los Alamos HIV sequence dataset (https://www.hiv
37
 
38
  ## Training Data
39
 
40
- This model was trained using the damlab/HIV_V3_coreceptor dataset using the 0th fold. The dataset consists of 2935 V3 sequences (approximately 35 tokens each) extracted from the Los Alamos HIV Sequence database.
41
 
42
  ## Training Procedure
43
 
44
  ### Preprocessing
45
 
46
- As with the rostlab/Prot-bert-bfd model, the rare amino acids U, Z, O, and B were converted to X and spaces were added between each amino acid. All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.
47
 
48
  ### Training
49
 
50
- The damlab/HIV-BERT model was used as the initial weights for an AutoModelforClassificiation. The model was trained with a learning rate of 1E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset. As this is a multiple classification task (a protein can bind to CCR5, CXCR4, neither, or both) the loss was calculated as the Binary Cross Entropy for each category. The BCE was weighted by the inverse of the class ratio to balance the weight across the class imbalance.
51
 
52
  ## Evaluation Results
53
 
 
1
  ---
2
  license: mit
3
+ widget:
4
+ - text: 'C T R P N N N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C'
5
+ - text: 'C T R P N N N T R K S I H I G P G R A F Y T T G Q I I G D I R Q A Y C'
6
+ - text: 'C T R P N N N T R R S I R I G P G Q A F Y A T G D I I G D I R Q A H C'
7
+ - text: 'C G R P N N H R I K G L R I G P G R A F F A M G A I G G G E I R Q A H C'
8
+
9
  ---
10
 
11
+ # HIV_V3_coreceptor model
12
 
13
  ## Table of Contents
 
14
  - [Summary](#model-summary)
15
  - [Model Description](#model-description)
16
  - [Intended Uses & Limitations](#intended-uses-&-limitations)
 
24
 
25
  ## Summary
26
 
27
+ The HIV-BERT-Coreceptor model was trained as a refinement of the [HIV-BERT model](https://huggingface.co/damlab/HIV_BERT) and serves to better predict HIV V3 coreceptor tropism. HIV-BERT is a model refined from the [ProtBert-BFD model](https://huggingface.co/Rostlab/prot_bert_bfd) to better fulfill HIV-centric tasks. This model was then trained using HIV V3 sequences from the [Los Alamos HIV Sequence Database](https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html), allowing even more precise prediction of V3 coreceptor tropism than the HIV-BERT model can provide.
28
 
29
  ## Model Description
30
 
 
34
 
35
  This tool can be used as a predictor of HIV tropism from the Env-V3 loop. It can recognize both R5, X4, and dual tropic viruses natively. It should not be considered a clinical diagnostic tool.
36
 
37
+ This tool was trained using the [Los Alamos HIV sequence dataset](https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html). Due to the sampling nature of this database, it is predominantly composed of subtype B sequences from North America and Europe with only minor contributions of Subtype C, A, and D. Currently, there was no effort made to balance the performance across these classes. As such, one should consider refinement with additional sequences to perform well on non-B sequences.
38
 
39
  ## How to use
40
 
 
42
 
43
  ## Training Data
44
 
45
+ This model was trained using the [damlab/HIV_V3_coreceptor dataset](https://huggingface.co/datasets/damlab/HIV_V3_coreceptor) using the 0th fold. The dataset consists of 2935 V3 sequences (approximately 35 tokens each) extracted from the [Los Alamos HIV Sequence database](https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html).
46
 
47
  ## Training Procedure
48
 
49
  ### Preprocessing
50
 
51
+ As with the [rostlab/Prot-bert-bfd model](https://huggingface.co/Rostlab/prot_bert_bfd), the rare amino acids U, Z, O, and B were converted to X and spaces were added between each amino acid. All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.
52
 
53
  ### Training
54
 
55
+ The [damlab/HIV-BERT model](https://huggingface.co/damlab/HIV_BERT) was used as the initial weights for an AutoModelforClassificiation. The model was trained with a learning rate of 1E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset. As this is a multiple classification task (a protein can bind to CCR5, CXCR4, neither, or both) the loss was calculated as the Binary Cross Entropy for each category. The BCE was weighted by the inverse of the class ratio to balance the weight across the class imbalance.
56
 
57
  ## Evaluation Results
58