Finnish-NLP
/

roberta-large-wechsel-finnish

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

aapot commited on Jan 16, 2022

Commit

caaf02d

•

1 Parent(s): 958a0dc

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -146,7 +146,7 @@ Contrary to BERT, the masking is done dynamically during pretraining (e.g., it c
 ### Pretraining
-The model was trained on TPUv3-8 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/), for 270k steps (a bit over 1 epoch) with a sequence length of 128 and continuing for 180k steps with a sequence length of 512. The optimizer used was Adafactor (to save memory). Learning rate was 2e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and \\(\epsilon = 1e-6\\), learning rate warmup for 2500 steps and linear decay of the learning rate after.
 ## Evaluation results

 ### Pretraining
+The model was trained on TPUv3-8 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/), for 270k steps (a bit over 1 epoch, 512 batch size) with a sequence length of 128 and continuing for 180k steps (batch size 64) with a sequence length of 512. The optimizer used was Adafactor (to save memory). Learning rate was 2e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and \\(\epsilon = 1e-6\\), learning rate warmup for 2500 steps and linear decay of the learning rate after.
 ## Evaluation results