fm4bio-ning commited on
Commit
8ded19c
1 Parent(s): 5c4b303

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -6,10 +6,10 @@ tags:
6
 
7
  AIDO.Protein stands as the largest protein foundation model in the world to date, trained on 1.2 trillion amino acids sourced from UniRef90 and ColabFoldDB.
8
 
9
- By leveraging MoE layers, AIDO.Protein efficiently scales to 16 billion parameters, delivering exceptional performance across a vast variety of tasks in protein sequence understanding and sequence generation. Remarkably, ADIO.Protein demonstrates exceptional capability despite being trained solely on single protein sequences. Across over 280 DMS protein fitness prediction tasks, our model outperforms previous state-of-the-art protein sequence models without MSA and achieves 99% of the performance of models that utilize MSA, , highlighting the strength of its learned representations.
10
 
11
  ## Model Architecture Details
12
- ADIO.Protein is a transformer encoder-only architecture with the dense MLP layer in each transformer block replaced by a sparse MoE layer. It uses single amino acid tokenization and is optimized using a masked languange modeling (MLM) training objective. For each token, 2 experts will be selectively activated by the top-2 rounting mechiansim.
13
  <center><img src="proteinmoe_architecture.png" alt="An Overview of AIDO.Protein" style="width:70%; height:auto;" /></center>
14
  More architecture details are shown below:
15
 
@@ -23,8 +23,8 @@ More architecture details are shown below:
23
  |Vocab Size|44 |
24
  | Context Length |2048 |
25
 
26
- ## Pre-training of ADIO.Protein 16B
27
- Here we briefly introduce the details of pre-training of ADIO.Protein 16B. For more information, please refer to [our paper](/https://openreview.net/pdf?id=6VldeCDKpH)
28
  ### Data
29
  Inspired by previous work, We initially trained AIDO.Protein with 1.2 trillion amino acids sourced from the combination of Uniref90 and ColabeFoldDB databases. Given the effectiveness of Uniref90 for previous protein language models and the observed benefits of continuous training on domina-specific data for enhancing downstream task performance, AIDO.Protein is further trained on an additional 100 billion amino acids from Uniref90.
30
 
@@ -45,7 +45,7 @@ Inspired by previous work, We initially trained AIDO.Protein with 1.2 trillion a
45
  ### Tokenization
46
  We encode protein sequence with single amino acid resolution with 44 vocabularies, where 24 tokens represent amino acid types and 20 are special tokens. Sequences were also prefixed with a `[CLS]` token as hooks for downstream tasks.
47
 
48
- ## Evaluation of ADIO.Protein 16B
49
  We assess the advantages of pretraining AIDO.Protein 16B through experiments across more than 300 tasks from two important protein benchmarks, xTrimoPGLM benchmark and ProteinGym DMS benchmark, encompassing residue-level, sequence-level, and protein-protein interaction (PPI) level tasks. We further adapted our model for structure-conditioned protein sequence generation tasks
50
 
51
  ## Results
@@ -113,7 +113,7 @@ For more information, visit: [Model Generator](https://github.com/genbio-ai/mode
113
  For more information, visit: Model Generator
114
 
115
  # Citation
116
- Please cite ADIO.Protein using the following BibTex code:
117
  ```
118
  @inproceedings{Sun2024mixture,
119
  title={Mixture of Experts Enable Efficient and Effective
 
6
 
7
  AIDO.Protein stands as the largest protein foundation model in the world to date, trained on 1.2 trillion amino acids sourced from UniRef90 and ColabFoldDB.
8
 
9
+ By leveraging MoE layers, AIDO.Protein efficiently scales to 16 billion parameters, delivering exceptional performance across a vast variety of tasks in protein sequence understanding and sequence generation. Remarkably, AIDO.Protein demonstrates exceptional capability despite being trained solely on single protein sequences. Across over 280 DMS protein fitness prediction tasks, our model outperforms previous state-of-the-art protein sequence models without MSA and achieves 99% of the performance of models that utilize MSA, , highlighting the strength of its learned representations.
10
 
11
  ## Model Architecture Details
12
+ AIDO.Protein is a transformer encoder-only architecture with the dense MLP layer in each transformer block replaced by a sparse MoE layer. It uses single amino acid tokenization and is optimized using a masked languange modeling (MLM) training objective. For each token, 2 experts will be selectively activated by the top-2 rounting mechiansim.
13
  <center><img src="proteinmoe_architecture.png" alt="An Overview of AIDO.Protein" style="width:70%; height:auto;" /></center>
14
  More architecture details are shown below:
15
 
 
23
  |Vocab Size|44 |
24
  | Context Length |2048 |
25
 
26
+ ## Pre-training of AIDO.Protein 16B
27
+ Here we briefly introduce the details of pre-training of AIDO.Protein 16B. For more information, please refer to [our paper](/https://openreview.net/pdf?id=6VldeCDKpH)
28
  ### Data
29
  Inspired by previous work, We initially trained AIDO.Protein with 1.2 trillion amino acids sourced from the combination of Uniref90 and ColabeFoldDB databases. Given the effectiveness of Uniref90 for previous protein language models and the observed benefits of continuous training on domina-specific data for enhancing downstream task performance, AIDO.Protein is further trained on an additional 100 billion amino acids from Uniref90.
30
 
 
45
  ### Tokenization
46
  We encode protein sequence with single amino acid resolution with 44 vocabularies, where 24 tokens represent amino acid types and 20 are special tokens. Sequences were also prefixed with a `[CLS]` token as hooks for downstream tasks.
47
 
48
+ ## Evaluation of AIDO.Protein 16B
49
  We assess the advantages of pretraining AIDO.Protein 16B through experiments across more than 300 tasks from two important protein benchmarks, xTrimoPGLM benchmark and ProteinGym DMS benchmark, encompassing residue-level, sequence-level, and protein-protein interaction (PPI) level tasks. We further adapted our model for structure-conditioned protein sequence generation tasks
50
 
51
  ## Results
 
113
  For more information, visit: Model Generator
114
 
115
  # Citation
116
+ Please cite AIDO.Protein using the following BibTex code:
117
  ```
118
  @inproceedings{Sun2024mixture,
119
  title={Mixture of Experts Enable Efficient and Effective