aapot commited on
Commit
caaf02d
1 Parent(s): 958a0dc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -146,7 +146,7 @@ Contrary to BERT, the masking is done dynamically during pretraining (e.g., it c
146
 
147
  ### Pretraining
148
 
149
- The model was trained on TPUv3-8 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/), for 270k steps (a bit over 1 epoch) with a sequence length of 128 and continuing for 180k steps with a sequence length of 512. The optimizer used was Adafactor (to save memory). Learning rate was 2e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and \\(\epsilon = 1e-6\\), learning rate warmup for 2500 steps and linear decay of the learning rate after.
150
 
151
  ## Evaluation results
152
 
 
146
 
147
  ### Pretraining
148
 
149
+ The model was trained on TPUv3-8 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/), for 270k steps (a bit over 1 epoch, 512 batch size) with a sequence length of 128 and continuing for 180k steps (batch size 64) with a sequence length of 512. The optimizer used was Adafactor (to save memory). Learning rate was 2e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and \\(\epsilon = 1e-6\\), learning rate warmup for 2500 steps and linear decay of the learning rate after.
150
 
151
  ## Evaluation results
152