probablybots commited on
Commit
6749405
1 Parent(s): 04d6b6b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -35
README.md CHANGED
@@ -2,9 +2,9 @@
2
  tags:
3
  - biology
4
  ---
5
- # AIDO.Protein 16B
6
 
7
- AIDO.Protein stands as the largest protein foundation model in the world to date, trained on 1.2 trillion amino acids sourced from UniRef90 and ColabFoldDB.
8
 
9
  By leveraging MoE layers, AIDO.Protein efficiently scales to 16 billion parameters, delivering exceptional performance across a vast variety of tasks in protein sequence understanding and sequence generation. Remarkably, AIDO.Protein demonstrates exceptional capability despite being trained solely on single protein sequences. Across over 280 DMS protein fitness prediction tasks, our model outperforms previous state-of-the-art protein sequence models without MSA and achieves 99% of the performance of models that utilize MSA, highlighting the strength of its learned representations.
10
 
@@ -23,7 +23,7 @@ More architecture details are shown below:
23
  |Vocab Size|44 |
24
  | Context Length |2048 |
25
 
26
- ## Pre-training of AIDO.Protein 16B
27
  Here we briefly introduce the details of pre-training of AIDO.Protein 16B. For more information, please refer to [our paper](https://www.biorxiv.org/content/10.1101/2024.11.29.625425v1)
28
  ### Data
29
  Inspired by previous work, We initially trained AIDO.Protein with 1.2 trillion amino acids sourced from the combination of Uniref90 and ColabeFoldDB databases. Given the effectiveness of Uniref90 for previous protein language models and the observed benefits of continuous training on domina-specific data for enhancing downstream task performance, AIDO.Protein is further trained on an additional 100 billion amino acids from Uniref90.
@@ -60,12 +60,19 @@ We assess the advantages of pretraining AIDO.Protein 16B through experiments acr
60
 
61
 
62
  ## How to Use
63
- ### Build any downstream models from this backbone
 
 
 
 
 
 
 
64
  #### Embedding
65
  ```python
66
- from genbio_finetune.tasks import Embed
67
- model = Embed.from_config({"model.backbone": "proteinfm"}).eval()
68
- collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
69
  embedding = model(collated_batch)
70
  print(embedding.shape)
71
  print(embedding)
@@ -73,9 +80,9 @@ print(embedding)
73
  #### Sequence Level Classification
74
  ```python
75
  import torch
76
- from genbio_finetune.tasks import SequenceClassification
77
- model = SequenceClassification.from_config({"model.backbone": "proteinfm", "model.n_classes": 2}).eval()
78
- collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
79
  logits = model(collated_batch)
80
  print(logits)
81
  print(torch.argmax(logits, dim=-1))
@@ -83,44 +90,33 @@ print(torch.argmax(logits, dim=-1))
83
  #### Token Level Classification
84
  ```python
85
  import torch
86
- from genbio_finetune.tasks import TokenClassification
87
- model = TokenClassification.from_config({"model.backbone": "proteinfm", "model.n_classes": 3}).eval()
88
- collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
89
  logits = model(collated_batch)
90
  print(logits)
91
  print(torch.argmax(logits, dim=-1))
92
  ```
93
  #### Regression
94
  ```python
95
- from genbio_finetune.tasks import SequenceRegression
96
- model = SequenceRegression.from_config({"model.backbone": "proteinfm"}).eval()
97
- collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
98
  logits = model(collated_batch)
99
  print(logits)
100
  ```
101
- #### Protein-Protein Interaction
102
-
103
- #### Or use our one-liner CLI to finetune or evaluate any of the above!
104
- ```
105
- gbft fit --model SequenceClassification --model.backbone proteinfm --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
106
- gbft test --model SequenceClassification --model.backbone proteinfm --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
107
- ```
108
- For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)
109
-
110
-
111
- # Or use our one-liner CLI to finetune or evaluate any of the above
112
-
113
- For more information, visit: Model Generator
114
 
115
  # Citation
116
  Please cite AIDO.Protein using the following BibTex code:
117
  ```
118
- @inproceedings{Sun2024mixture,
119
- title={Mixture of Experts Enable Efficient and Effective
120
- Protein Understanding and Design},
121
- author={Ning Sun, Shuxian Zou, Tianhua Tao, Sazan Mahbub, Dian Li, Yonghao Zhuang, Hongyi Wang, Le Song, Eric P. Xing},
122
- booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
123
- year={2024}
 
 
124
  }
125
  ```
126
 
 
2
  tags:
3
  - biology
4
  ---
5
+ # AIDO.Protein-16B
6
 
7
+ AIDO.Protein-16B is a protein language model, trained on 1.2 trillion amino acids sourced from UniRef90 and ColabFoldDB.
8
 
9
  By leveraging MoE layers, AIDO.Protein efficiently scales to 16 billion parameters, delivering exceptional performance across a vast variety of tasks in protein sequence understanding and sequence generation. Remarkably, AIDO.Protein demonstrates exceptional capability despite being trained solely on single protein sequences. Across over 280 DMS protein fitness prediction tasks, our model outperforms previous state-of-the-art protein sequence models without MSA and achieves 99% of the performance of models that utilize MSA, highlighting the strength of its learned representations.
10
 
 
23
  |Vocab Size|44 |
24
  | Context Length |2048 |
25
 
26
+ ## Pre-training of AIDO.Protein-16B
27
  Here we briefly introduce the details of pre-training of AIDO.Protein 16B. For more information, please refer to [our paper](https://www.biorxiv.org/content/10.1101/2024.11.29.625425v1)
28
  ### Data
29
  Inspired by previous work, We initially trained AIDO.Protein with 1.2 trillion amino acids sourced from the combination of Uniref90 and ColabeFoldDB databases. Given the effectiveness of Uniref90 for previous protein language models and the observed benefits of continuous training on domina-specific data for enhancing downstream task performance, AIDO.Protein is further trained on an additional 100 billion amino acids from Uniref90.
 
60
 
61
 
62
  ## How to Use
63
+ ### Build any downstream models from this backbone with ModelGenerator
64
+ For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)
65
+ ```bash
66
+ mgen fit --model SequenceClassification --model.backbone aido_protein_16b --data SequenceClassificationDataModule --data.path <hf_or_local_path_to_your_dataset>
67
+ mgen test --model SequenceClassification --model.backbone aido_protein_16b --data SequenceClassificationDataModule --data.path <hf_or_local_path_to_your_dataset>
68
+ ```
69
+
70
+ ### Or use directly in Python
71
  #### Embedding
72
  ```python
73
+ from modelgenerator.tasks import Embed
74
+ model = Embed.from_config({"model.backbone": "aido_protein_16b"}).eval()
75
+ collated_batch = model.collate({"sequences": ["HLLQ", "WRLD"]})
76
  embedding = model(collated_batch)
77
  print(embedding.shape)
78
  print(embedding)
 
80
  #### Sequence Level Classification
81
  ```python
82
  import torch
83
+ from modelgenerator.tasks import SequenceClassification
84
+ model = SequenceClassification.from_config({"model.backbone": "aido_protein_16b", "model.n_classes": 2}).eval()
85
+ collated_batch = model.collate({"sequences": ["HLLQ", "WRLD"]})
86
  logits = model(collated_batch)
87
  print(logits)
88
  print(torch.argmax(logits, dim=-1))
 
90
  #### Token Level Classification
91
  ```python
92
  import torch
93
+ from modelgenerator.tasks import TokenClassification
94
+ model = TokenClassification.from_config({"model.backbone": "aido_protein_16b", "model.n_classes": 3}).eval()
95
+ collated_batch = model.collate({"sequences": ["HLLQ", "WRLD"]})
96
  logits = model(collated_batch)
97
  print(logits)
98
  print(torch.argmax(logits, dim=-1))
99
  ```
100
  #### Regression
101
  ```python
102
+ from modelgenerator.tasks import SequenceRegression
103
+ model = SequenceRegression.from_config({"model.backbone": "aido_protein_16b"}).eval()
104
+ collated_batch = model.collate({"sequences": ["HLLQ", "WRLD"]})
105
  logits = model(collated_batch)
106
  print(logits)
107
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
108
 
109
  # Citation
110
  Please cite AIDO.Protein using the following BibTex code:
111
  ```
112
+ @inproceedings{sun_mixture_2024,
113
+ title = {Mixture of Experts Enable Efficient and Effective Protein Understanding and Design},
114
+ url = {https://www.biorxiv.org/content/10.1101/2024.11.29.625425v1},
115
+ doi = {10.1101/2024.11.29.625425},
116
+ publisher = {bioRxiv},
117
+ author = {Sun, Ning and Zou, Shuxian and Tao, Tianhua and Mahbub, Sazan and Li, Dian and Zhuang, Yonghao and Wang, Hongyi and Cheng, Xingyi and Song, Le and Xing, Eric P.},
118
+ year = {2024},
119
+ booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
120
  }
121
  ```
122