REMARKS.md · yhavinga/pre-training-dutch-t5-models at 8f8e390bc827368218744589c6506e81827b64fd

Miscellaneous remarks

Use loss regularization if you train with bfloat16 (more info below)
Beware of the dropout rate in the config.json file. Check in a model's config.json what the dropout rate has been set to. Unless you intend to run many epochs on the same data, its worth to try a training run without dropout. If you want to compare losses, be sure to set the dropout rate equal. The smaller models can probably always be trained without.
For the translation task, I am not sure that a 'deep-narrow' model (e.g. base-nl36) is better than a normal model or even a 'wide-deep' model.
Training with more layers is much slower than you'd expect from the increased model size. It is also more difficult to get batch size and learning rate right. Below is a section about finding the right hyperparameters for the base-36L training.
The 'larger' models are not only harder to pre-train, but also harder to fine-tune. The optimizer eats up a lot of space, and the amount of memory required also depends on the length of source and target sequences.
PyCharms remote debugging features are useful to inspect variables on either a TPU VM or your deep-learning rig.
When increasing the batch size, increase the learning rate. bs * 2 -> lr * sqrt(2) is a good heuristic but mileage may vary.
Translation evaluation: the low score of the 128 seq len models on opus books may be because of the brevity penaly... that books may have sentences longer than 128 tokens.
Dataset quality is a key success factor. Do not expect a model to magically turn mediocre data into magic. This holds for the pre-training data, fine-tuning and also evaluating.
Good Bleu score does not necessarily mean fluent text. Evaluation loss on a large translation dataset might be better suited for model comparison.