Spaces:
Runtime error
Runtime error
## Training procedure | |
### Preprocessing | |
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of | |
the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked | |
with `<s>` and the end of one by `</s>` | |
The details of the masking procedure for each sentence are the following: | |
- 15% of the tokens are masked. | |
- In 80% of the cases, the masked tokens are replaced by `<mask>`. | |
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace. | |
- In the 10% remaining cases, the masked tokens are left as is. | |
Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed). | |
### Pretraining | |
The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) **8 v3 TPU cores** for 42K steps with a batch size of 128 and a sequence length of 128. The | |
optimizer used is Adam with a learning rate of 6e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and | |
\\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 24,000 steps and linear decay of the learning | |
rate after. | |