fm4bio-ning
commited on
Commit
•
831b1b4
1
Parent(s):
8ded19c
Update README.md
Browse files
README.md
CHANGED
@@ -24,7 +24,7 @@ More architecture details are shown below:
|
|
24 |
| Context Length |2048 |
|
25 |
|
26 |
## Pre-training of AIDO.Protein 16B
|
27 |
-
Here we briefly introduce the details of pre-training of AIDO.Protein 16B. For more information, please refer to [our paper](
|
28 |
### Data
|
29 |
Inspired by previous work, We initially trained AIDO.Protein with 1.2 trillion amino acids sourced from the combination of Uniref90 and ColabeFoldDB databases. Given the effectiveness of Uniref90 for previous protein language models and the observed benefits of continuous training on domina-specific data for enhancing downstream task performance, AIDO.Protein is further trained on an additional 100 billion amino acids from Uniref90.
|
30 |
|
@@ -43,7 +43,7 @@ Inspired by previous work, We initially trained AIDO.Protein with 1.2 trillion a
|
|
43 |
|3rd Stage Num Tokens| 100 billion|
|
44 |
|
45 |
### Tokenization
|
46 |
-
We encode protein sequence with single amino acid resolution with 44 vocabularies, where 24 tokens represent amino acid types and 20 are special tokens. Sequences were also
|
47 |
|
48 |
## Evaluation of AIDO.Protein 16B
|
49 |
We assess the advantages of pretraining AIDO.Protein 16B through experiments across more than 300 tasks from two important protein benchmarks, xTrimoPGLM benchmark and ProteinGym DMS benchmark, encompassing residue-level, sequence-level, and protein-protein interaction (PPI) level tasks. We further adapted our model for structure-conditioned protein sequence generation tasks
|
|
|
24 |
| Context Length |2048 |
|
25 |
|
26 |
## Pre-training of AIDO.Protein 16B
|
27 |
+
Here we briefly introduce the details of pre-training of AIDO.Protein 16B. For more information, please refer to [our paper](https://www.biorxiv.org/content/10.1101/2024.11.29.625425v1)
|
28 |
### Data
|
29 |
Inspired by previous work, We initially trained AIDO.Protein with 1.2 trillion amino acids sourced from the combination of Uniref90 and ColabeFoldDB databases. Given the effectiveness of Uniref90 for previous protein language models and the observed benefits of continuous training on domina-specific data for enhancing downstream task performance, AIDO.Protein is further trained on an additional 100 billion amino acids from Uniref90.
|
30 |
|
|
|
43 |
|3rd Stage Num Tokens| 100 billion|
|
44 |
|
45 |
### Tokenization
|
46 |
+
We encode protein sequence with single amino acid resolution with 44 vocabularies, where 24 tokens represent amino acid types and 20 are special tokens. Sequences were also suffixed with a `[SEP]` token as hooks for downstream tasks.
|
47 |
|
48 |
## Evaluation of AIDO.Protein 16B
|
49 |
We assess the advantages of pretraining AIDO.Protein 16B through experiments across more than 300 tasks from two important protein benchmarks, xTrimoPGLM benchmark and ProteinGym DMS benchmark, encompassing residue-level, sequence-level, and protein-protein interaction (PPI) level tasks. We further adapted our model for structure-conditioned protein sequence generation tasks
|