Update README.md
Browse files
README.md
CHANGED
@@ -146,7 +146,7 @@ Contrary to BERT, the masking is done dynamically during pretraining (e.g., it c
|
|
146 |
|
147 |
### Pretraining
|
148 |
|
149 |
-
The model was trained on TPUv3-8 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/), for 270k steps (a bit over 1 epoch) with a sequence length of 128 and continuing for 180k steps with a sequence length of 512. The optimizer used was Adafactor (to save memory). Learning rate was 2e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and \\(\epsilon = 1e-6\\), learning rate warmup for 2500 steps and linear decay of the learning rate after.
|
150 |
|
151 |
## Evaluation results
|
152 |
|
|
|
146 |
|
147 |
### Pretraining
|
148 |
|
149 |
+
The model was trained on TPUv3-8 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/), for 270k steps (a bit over 1 epoch, 512 batch size) with a sequence length of 128 and continuing for 180k steps (batch size 64) with a sequence length of 512. The optimizer used was Adafactor (to save memory). Learning rate was 2e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and \\(\epsilon = 1e-6\\), learning rate warmup for 2500 steps and linear decay of the learning rate after.
|
150 |
|
151 |
## Evaluation results
|
152 |
|