librarian-bot commited on
Commit
5d7deea
1 Parent(s): cc42f68

Librarian Bot: Update dataset YAML metadata for model

Browse files

This is a pull request to add a dataset, [`damlab/HIV_PI`](https://huggingface.co/datasets/damlab/HIV_PI), to the metadata for your model (defined in the `YAML` block of your model's `README.md`).
The pull request was made by [librarian-bot](https://huggingface.co/librarian-bot) and used a combination of rules and/or machine learning to suggest this additional metadata.
If this suggestion is incorrect, feel free to close this pull request.

Librarian Bot was made by [@davanstrien](https://huggingface.co/davanstrien); feel free to get in touch with feedback.

Files changed (1) hide show
  1. README.md +55 -56
README.md CHANGED
@@ -1,57 +1,56 @@
1
-
2
- ---
3
- license: mit
4
-
5
- ---
6
-
7
- # HIV_PR_resist model
8
-
9
- ## Table of Contents
10
- - [Summary](#model-summary)
11
- - [Model Description](#model-description)
12
- - [Intended Uses & Limitations](#intended-uses-&-limitations)
13
- - [How to Use](#how-to-use)
14
- - [Training Data](#training-data)
15
- - [Training Procedure](#training-procedure)
16
- - [Preprocessing](#preprocessing)
17
- - [Training](#training)
18
- - [Evaluation Results](#evaluation-results)
19
- - [BibTeX Entry and Citation Info](#bibtex-entry-and-citation-info)
20
-
21
- ## Summary
22
-
23
- The HIV-BERT-Protease-Resistance model was trained as a refinement of the HIV-BERT model (insert link) and serves to better predict whether an HIV protease sequence will be resistant to certain protease inhibitors. HIV-BERT is a model refined from the [ProtBert-BFD model](https://huggingface.co/Rostlab/prot_bert_bfd) to better fulfill HIV-centric tasks. This model was then trained using HIV protease sequences from the [Stanford HIV Genotype-Phenotype Database](https://hivdb.stanford.edu/pages/genotype-phenotype.html), allowing even more precise prediction protease inhibitor resistance than the HIV-BERT model can provide.
24
-
25
- ## Model Description
26
-
27
- The HIV-BERT-Protease-Resistance model is intended to predict the likelihood that an HIV protease sequence will be resistant to protease inhibitors. The protease gene is responsible for cleaving viral proteins into their active states, and as such is an ideal target for antiretroviral therapy. Annotation programs designed to predict and identify protease resistance using known mutations already exist, however with varied results. The HIV-BERT-Protease-Resistance model is designed to provide an alternative, NLP-based mechanism for predicting resistance mutations when provided with an HIV protease sequence.
28
-
29
- ## Intended Uses & Limitations
30
-
31
- This tool can be used as a predictor of protease resistance mutations within an HIV genomic sequence. It should not be considered a clinical diagnostic tool.
32
-
33
- ## How to use
34
-
35
- *Prediction example of protease sequences*
36
-
37
- ## Training Data
38
-
39
- This model was trained using the [damlab/HIV-PI dataset](https://huggingface.co/datasets/damlab/HIV_PI) using the 0th fold. The dataset consists of 1959 sequences (approximately 99 tokens each) extracted from the Stanford HIV Genotype-Phenotype Database.
40
-
41
- ## Training Procedure
42
-
43
- ### Preprocessing
44
-
45
- As with the [rostlab/Prot-bert-bfd model](https://huggingface.co/Rostlab/prot_bert_bfd), the rare amino acids U, Z, O, and B were converted to X and spaces were added between each amino acid. All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.
46
-
47
- ### Training
48
-
49
- The [damlab/HIV-BERT model](https://huggingface.co/damlab/HIV_BERT) was used as the initial weights for an AutoModelforClassificiation. The model was trained with a learning rate of 1E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset. As this is a multiple classification task (a protein can be resistant to multiple drugs) the loss was calculated as the Binary Cross Entropy for each category. The BCE was weighted by the inverse of the class ratio to balance the weight across the class imbalance.
50
-
51
- ## Evaluation Results
52
-
53
- *Need to add*
54
-
55
- ## BibTeX Entry and Citation Info
56
-
57
  [More Information Needed]
 
1
+ ---
2
+ license: mit
3
+ datasets: damlab/HIV_PI
4
+ ---
5
+
6
+ # HIV_PR_resist model
7
+
8
+ ## Table of Contents
9
+ - [Summary](#model-summary)
10
+ - [Model Description](#model-description)
11
+ - [Intended Uses & Limitations](#intended-uses-&-limitations)
12
+ - [How to Use](#how-to-use)
13
+ - [Training Data](#training-data)
14
+ - [Training Procedure](#training-procedure)
15
+ - [Preprocessing](#preprocessing)
16
+ - [Training](#training)
17
+ - [Evaluation Results](#evaluation-results)
18
+ - [BibTeX Entry and Citation Info](#bibtex-entry-and-citation-info)
19
+
20
+ ## Summary
21
+
22
+ The HIV-BERT-Protease-Resistance model was trained as a refinement of the HIV-BERT model (insert link) and serves to better predict whether an HIV protease sequence will be resistant to certain protease inhibitors. HIV-BERT is a model refined from the [ProtBert-BFD model](https://huggingface.co/Rostlab/prot_bert_bfd) to better fulfill HIV-centric tasks. This model was then trained using HIV protease sequences from the [Stanford HIV Genotype-Phenotype Database](https://hivdb.stanford.edu/pages/genotype-phenotype.html), allowing even more precise prediction protease inhibitor resistance than the HIV-BERT model can provide.
23
+
24
+ ## Model Description
25
+
26
+ The HIV-BERT-Protease-Resistance model is intended to predict the likelihood that an HIV protease sequence will be resistant to protease inhibitors. The protease gene is responsible for cleaving viral proteins into their active states, and as such is an ideal target for antiretroviral therapy. Annotation programs designed to predict and identify protease resistance using known mutations already exist, however with varied results. The HIV-BERT-Protease-Resistance model is designed to provide an alternative, NLP-based mechanism for predicting resistance mutations when provided with an HIV protease sequence.
27
+
28
+ ## Intended Uses & Limitations
29
+
30
+ This tool can be used as a predictor of protease resistance mutations within an HIV genomic sequence. It should not be considered a clinical diagnostic tool.
31
+
32
+ ## How to use
33
+
34
+ *Prediction example of protease sequences*
35
+
36
+ ## Training Data
37
+
38
+ This model was trained using the [damlab/HIV-PI dataset](https://huggingface.co/datasets/damlab/HIV_PI) using the 0th fold. The dataset consists of 1959 sequences (approximately 99 tokens each) extracted from the Stanford HIV Genotype-Phenotype Database.
39
+
40
+ ## Training Procedure
41
+
42
+ ### Preprocessing
43
+
44
+ As with the [rostlab/Prot-bert-bfd model](https://huggingface.co/Rostlab/prot_bert_bfd), the rare amino acids U, Z, O, and B were converted to X and spaces were added between each amino acid. All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.
45
+
46
+ ### Training
47
+
48
+ The [damlab/HIV-BERT model](https://huggingface.co/damlab/HIV_BERT) was used as the initial weights for an AutoModelforClassificiation. The model was trained with a learning rate of 1E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset. As this is a multiple classification task (a protein can be resistant to multiple drugs) the loss was calculated as the Binary Cross Entropy for each category. The BCE was weighted by the inverse of the class ratio to balance the weight across the class imbalance.
49
+
50
+ ## Evaluation Results
51
+
52
+ *Need to add*
53
+
54
+ ## BibTeX Entry and Citation Info
55
+
 
56
  [More Information Needed]