rsuresh2011 commited on
Commit
f17110c
1 Parent(s): 40e9d76

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -2
README.md CHANGED
@@ -1,3 +1,72 @@
1
  ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ # Model Card for Antibody Generator (Based on ProGen2)
3
+
4
+ ## Model Details
5
+ - Model Name: Antibody Generator
6
+ - Version: 1.0
7
+ - Release Date: 12/15/2023
8
+ - Model Developer: Joesph Roberts, David Noble, Rahul Suresh, Neel Patel
9
+ - Model Type: Protein Generation, based on the ProGen2 architecture.
10
+ - License: Apache 2.0
11
+ - Code Repository: https://github.com/joethequant/docker_protein_generator, https://github.com/joethequant/docker_streamlit_antibody_protein_generation
12
+ - Baseline Model Reference: [ProGen2 Paper](https://arxiv.org/pdf/2206.13517.pdf)
13
+
14
+ ## Model Overview
15
+ The Antibody Generator is a specialized protein generation model developed for creating therapeutic antibodies. It is based on the ProGen2 model, an advanced language model developed by Salesforce. ProGen2, an enhancement of the original ProGen model launched in 2020, is pre-trained on a vast dataset of over 280 million protein sequences. With up to 6.4B parameters, ProGen2 demonstrates state-of-the-art performance in generating novel, viable protein sequences and predicting protein fitness.
16
+
17
+ ## Intended Use
18
+ - Primary Use Case: Generation of therapeutic antibody sequences for use in immunology, vaccine development, and medical treatments.
19
+ - Target Users: Researchers and practitioners in bioinformatics, molecular biology, and related fields.
20
+
21
+ ## Training Data
22
+ - Baseline Model Data: ProGen2 was trained on a large collection of protein sequences from genomic, metagenomic, and immune repertoire databases, totaling over 280 million samples.
23
+ - Fine-tuning Data: For fine-tuning, the Structural Antibody Database was used, comprising approximately 5,000 experimentally-resolved crystal structures of antibodies and their antigens.
24
+
25
+ ## Model Variants
26
+
27
+ Models are labeled as progen2_<size>_<finetuning_type>_<prompting_type>
28
+
29
+ 1. Size: This refers to the size of the progen2 base model that was used. There are 4 variants available:
30
+ 1. Small: 151M params
31
+ 2. Medium: 764M params
32
+ 3. Large: 2.7B params
33
+ 4. xLarge: 6.4B params
34
+ 2. Finetuning_type: This refers to how the base model was finetuned. 2 types are supported:
35
+ 1. No finetuning
36
+ 2. Simple finetuning: The base model is finetuned with 5,000 experimentally-resolved crystal structures of antibodies and their antigens using hyperparameters below.
37
+ 3. Frozen layer finetuning:The base model is finetuned with 5,000 experimentally-resolved crystal structures of antibodies and their antigens using hyperparameters below. Additionally, all layers except last 3 are frozen to avoid overfitting.
38
+ 3. Prompting_type: This refers to whether the model was provided with any prompting during inference.
39
+ 1. Prompted: Use prompt engineering for generating therapeutic antibody sequences.
40
+ 2. Zeroshot: No prompting is provided.
41
+
42
+ ## Model Hyperparameters
43
+ - Batch size: 40
44
+ - Epochs: 10
45
+ - Learning rate: 0.00001
46
+
47
+ ## Evaluation and Performance
48
+ Evaluation Tools:
49
+ 1. ANARCI: The model is evaluated using ANARCI, a tool for antibody numbering and receptor classification. ANARCI is employed to analyze the generated antibody sequences for their conformity to known antibody sequence patterns and structures. It helps in assessing the accuracy of the model's outputs in terms of their structural viability and alignment with known antibody frameworks. This evaluation is crucial to ensure that the generated sequences are not only novel but also biologically relevant and potentially functional. [ANARCI](https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabpred/anarci/)
50
+ 2. Diversity Score: Diversity: We can measure how diverse each model’s outputs are by computing the sequence similarity between candidates for each possible pairing from the outputs. The average of this distribution indicates how widely the model’s outputs vary, which is useful to know for downstream evaluation of the generated candidates. We compute the average sequence similarity for both the entire variable sequence, as well as just the HCDR3 region.
51
+
52
+ Performance and analytics:
53
+
54
+ ![Model_Performance](model_perf.png)
55
+
56
+
57
+ ## Ethical Considerations
58
+ - Use Case Limitations: Generated antibodies should be validated experimentally before clinical or research applications.
59
+ - Misuse Potential: Users should be aware of the potential misuse of generated sequences in harmful applications.
60
+
61
+ ## How to Use
62
+ Instructions on how to use the model, including example prompts and API documentation, are available in the [Code Repository](https://github.com/joethequant/docker_streamlit_antibody_protein_generation).
63
+
64
+ ## Limitations and Future Work
65
+ - Predictions require experimental validation for practical use.
66
+ - Future improvements will focus on incorporating diverse training data and enhancing prediction accuracy for the efficacy of generated antibodies.
67
+
68
+ ## Contact Information
69
+ For questions or feedback regarding this model, please contact [XYZ].
70
+
71
+
72
+