fm4bio-ning commited on
Commit
75a364e
1 Parent(s): 4a40b55

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -2
README.md CHANGED
@@ -28,13 +28,19 @@ Here we briefly introduce the details of pre-training of ADIO.Protein 16B. For m
28
  ### Data
29
  Inspired by previous work, We initially trained AIDO.Protein with 1.2 trillion amino acids sourced from the combination of Uniref90 and ColabeFoldDB databases. Given the effectiveness of Uniref90 for previous protein language models and the observed benefits of continuous training on domina-specific data for enhancing downstream task performance, AIDO.Protein is further trained on an additional 100 billion amino acids from Uniref90.
30
 
31
- ###Training Details
32
  The weights of our 16 billion parameter model occupy over 200GB of memory in 32 bit precision. To train a model of this size, we use model and tensor parallelism to split training across 256 H100 GPUs using the Megatron-LM framework. We also employed bfloat16 mixed precision training to allow for training with large context length at scale. With this configuration, AIDO.Protein 16B took 25 days to train.
33
  | Hyper-params | Value |
34
  | ------------- |:-------------:|
35
  | Global Batch Size | 2048 |
36
  | Per Device Micro Batch Size | 8 |
37
  | Precision | Mixed FP32-BF16 |
 
 
 
 
 
 
38
 
39
  ### Tokenization
40
  We encode protein sequence with single amino acid resolution with 44 vocabularies, where 24 tokens represent amino acid types and 20 are special tokens. Sequences were also prefixed with a `[CLS]` token as hooks for downstream tasks.
@@ -48,7 +54,6 @@ We assess the advantages of pretraining AIDO.Protein 16B through experiments acr
48
  - ProteinGym DMS Benchmark
49
 
50
  - Inverse Folding Generation
51
- <center><img src="" alt="Downstream results of DNA FM 7B" style="width:70%; height:auto;" /></center>
52
 
53
  ## How to Use
54
  ### Build any downstream models from this backbone
 
28
  ### Data
29
  Inspired by previous work, We initially trained AIDO.Protein with 1.2 trillion amino acids sourced from the combination of Uniref90 and ColabeFoldDB databases. Given the effectiveness of Uniref90 for previous protein language models and the observed benefits of continuous training on domina-specific data for enhancing downstream task performance, AIDO.Protein is further trained on an additional 100 billion amino acids from Uniref90.
30
 
31
+ ### Training Details
32
  The weights of our 16 billion parameter model occupy over 200GB of memory in 32 bit precision. To train a model of this size, we use model and tensor parallelism to split training across 256 H100 GPUs using the Megatron-LM framework. We also employed bfloat16 mixed precision training to allow for training with large context length at scale. With this configuration, AIDO.Protein 16B took 25 days to train.
33
  | Hyper-params | Value |
34
  | ------------- |:-------------:|
35
  | Global Batch Size | 2048 |
36
  | Per Device Micro Batch Size | 8 |
37
  | Precision | Mixed FP32-BF16 |
38
+ |1st Stage LR| [2e-6,2e-4]|
39
+ |2nd Stage LR| [1e-6,1e-5]|
40
+ |3rd Stage LR| [1e-6,1e-5]|
41
+ |1st Stage Num Tokens| 1 trillion|
42
+ |2nd Stage Num Tokens| 200 billion|
43
+ |3rd Stage Num Tokens| 100 billion|
44
 
45
  ### Tokenization
46
  We encode protein sequence with single amino acid resolution with 44 vocabularies, where 24 tokens represent amino acid types and 20 are special tokens. Sequences were also prefixed with a `[CLS]` token as hooks for downstream tasks.
 
54
  - ProteinGym DMS Benchmark
55
 
56
  - Inverse Folding Generation
 
57
 
58
  ## How to Use
59
  ### Build any downstream models from this backbone