Pre-training Dutch T5 models

TL;DR, Look below for the list of pre-trained Dutch and Dutch+English models.

A few months ago, I was given access to Google's TPU Research Cloud (TRC). My goal was to train several Dutch and Dutch+English T5 models, limited to model sizes that can run on a single GPU. T5 is a text-to-text transfer transformer, a neural network model with natural language text as input and output. It can be fine-tuned on a wide range of tasks.

Background on Google's TPU-VM and how to use the Huggingface transformers library to pre-train models can be found at the following pages:

This project is a continuation of the work I performed together with Dat Nguyen during the Flax/JAX Community Week to create a T5 model pre-trained from scratch on Dutch.

Pre-training

mC4 dataset

The multilingual C4 (mC4) dataset was created by the original T5 authors. It was prepared and released by AllenNLP on the Huggingface Dataset hub. Our team cleaned Dutch mC4 with code adapted from the C4 TensorFlow dataset, and used the resulting text files in the pre-training scripts. We also verified that Dutch C4 was deduplicated.

To be able to easily reuse this dataset for more pre-training sessions with Huggingfaces scripts, a Huggingface dataset was created: mc4_nl_cleaned. For Dutch and English training, a couple of additional configs were added to the generation script. These configs produce interleaved Dutch and English texts with a 1:1 ratio. For instance, the micro_en_nl config config mixes Dutch with English samples. The cleaned English C4 dataset is about 5 times larger (in compressed bytes) than the Dutch part. 1:1 interleaving with Dutch discards about 80% of English C4. The full cleaned Dutch mC4 dataset is 151GB, and still is (June '22) the largest Dutch cleaned corpus currently available on the HF Hub.

Unsupervised Training Objective

The Dutch and Dutch+English T5 models are pre-trained with the masked language modeling (MLM) "span corruption" objective. During pre-training, 15% of the tokens are masked and each span of masked tokens is replaced by a sentinel token.

Why are some models trained for multiple epochs on a smaller config?

When I was using an old version of the Flax T5 MLM pretraining script, I noticed that the per-batch training speed seemed slower at the beginning of epochs when a larger dataset config was used. Also, on large configs, batch shuffling would fail with a TPU out-of-memory error. For these reasons, I started experimenting with training for more epochs on smaller configs.

This should be ok. In the original T5 paper downstream performance was compared between training on 235 tokens vs training multiple epochs on a smaller part. 64 repeats of 229 tokens did not result in degraded downstream performance. The model yhavinga/t5-v1_1-base-dutch-english-cased is trained on the small config for 10 epochs.

In the end, a change to the pre-training script to perform batch shuffling (permuting an array) on the CPU instead of the accelerator device solved all related issues, and larger configs could be used without any issues.

Which optimizer and lr to use

During the Flax/Jax Community week we quickly decided on using Adafactor with learning rate 5e-3. I was sure that with more time, a better setting could be found. After performing seven sweeps with Adafactor, AdamW and Distributed Shampoo (experimental PJIT version from Dall-E mini), I gave up to find better settings. The graph below shows the runs from all 7 sweeps combined. Apologies for the legend, I cannot show the optimizer in the legend, because the initial version of the training script had the optimizer --adafactor as boolean, which I later changed to a string with the optimizer name. All runs in the graph below that get the loss below 4 use Adafactor. Peach-sweep-6 is dashed orange and has learning rate 5e-3.

Adafactor vs Adam vs Shampoo

While there probably is a setting that will allow Adam and Shampoo to also converge fast below loss 4.0, I was unable to find it. In a recent tweet Lucas Nestler had more success with Shampoo (https://twitter.com/_clashluke/status/1535994026876252160) so maybe I need to revisit the attempt with the latest upstream code bases.

Bfloat16 datatype and learning rate schedule

I had some additional options in the pre-training script that I wanted to use. An exponential decay learning rate schedule would allow me to pre-train for as long as desired, instead of a fixed number of steps. I was also keen to pre-train with bfloat16, for the reduced memory footprint and speed. This failed. The graph below shows different attempts with the legend showing the optimizer, dtype, learning rate, total batch size and lr-schedule to train t5-small-24L-dutch-english.

Bfloat16 vs Float32

In the end, all models released on the hub are trained with Flax in float32. For reference, I've ran Stas Bekman's script for bf16, fp16 or fp32 model pretrain detection.


     
name | abs min | abs max ---------------------------------------------------|-----------|----------- yhavinga/t5-base-dutch | 1.757e-09 | 6.792e+01 yhavinga/t5-v1.1-base-dutch-uncased | 1.218e-09 | 6.708e+02 yhavinga/t5-v1.1-base-dutch-cased | 3.009e-09 | 8.821e+02 yhavinga/t5-v1.1-large-dutch-cased | 0.000e+00 | 5.053e+03 yhavinga/t5-v1_1-base-dutch-english-cased | 5.140e-09 | 3.111e+03 yhavinga/t5-v1_1-base-dutch-english-cased-1024 | 9.359e-10 | 1.308e+02 yhavinga/t5-small-24L-dutch-english | 1.577e-09 | 1.276e+02 yhavinga/t5-xl-4L-dutch-english-cased | 3.234e-11 | 3.986e+01 yhavinga/t5-base-36L-dutch-english-cased | 2.409e-10 | 6.104e+01 yhavinga/t5-eff-xl-8l-dutch-english-cased | 5.530e-10 | 8.912e+02 yhavinga/t5-eff-large-8l-dutch-english-cased | 1.086e-10 | 5.128e+02 yhavinga/t5-base-36L-ccmatrix-multi | 1.715e-11 | 3.746e+01 yhavinga/t5-small-24L-ccmatrix-multi | 7.086e-10 | 1.053e+02

Fine-tuning

Training t5-base-36L-dutch-english

The following image shows the loss curves of the sessions in which I was trying to find the right combination of total batch size (by adjusting gradient accumulation), learning rate and datatype. Unfortunately, again I could not find a good setting for bfloat16. The three green runs are the ones that end up in t5-base-36L-dutch-english. Numbers shown are learning reate, dtype and total batch size.

t5 base 36L training losses

Evaluation

Optimizer and learning rate for summarization

Finetuning summarization requires more memory than translation due to the longer sequence lengths involved. I wondered if I could use Adafactor instead of Adam and ran a sweep to test this. The sweep was configured with Hyperband, so not all training runs completed to the end.

Optimizer Learning rate for summarization

The training losses are graphed below:

Training losses for summarization sweep

While the Adafactor run with learning rate 7e-4 came close to the Adam runs, the consistent stability of training with Adam made me stick with Adam as optimizer for evaluation runs on the several models. For translation the results were similar, though in the end I needed to configure a lower learning rate for all models to converge during fine-tuning.

Running evaluation runs

The original T5 paper evaluated by fine-tuning on downstream tasks with a constant learning rate of 0.001. According to the sweep 0.001 would work nicely with the Adam optimizer for summarization. A single model evaluation consisted of fine-tuning the model, followed by running predictions and metrics calculation on the test split. Fine-tuning for evaluation was done on a limited set of example from the fine-tuning datasets.

Summarization Translation
Dataset CNN Dailymail NL CCMatrix en -> nl
#Samples 50K 50K
Optimizer Adam Adam
learning rate 0.001 0.0005
source length 1024 128
target length 142 128
#eval samples 1000 1000

The graph below shows the train loss curves for the summarization runs:

Train loss evaluation T5 summarization

The graph below shows the train loss curves for the translation runs:

Train loss evaluation T5 translation

The figure below shows the evaluation scores, where the x-axis shows the translation Bleu score (higher is better) and y-axis the summarization Rouge1 translation score (higher is better). Point size is proportional to the model size. Models with faster inference speed are green, slower inference speed is plotted as blue.

Evaluation T5 Dutch English

While it is clear that the model t5-base-36L-dutch-english-cased (with 729M parameters) has the best scores, it also among the slowest models. The model t5-eff-large-8l-dutch-english-cased (with 335M parameters) has the second best training loss after 390 steps in both tasks, but with a 4 times faster inference. Surprizing is the difference between t5-v1_1-base-dutch-english-cased and t5-v1_1-base-dutch-english-cased-1024, most notable on the summarization task. This might be due to the difference in pre-training sequence length:

Sequence length 512 or 1024

The models t5-v1_1-base-dutch-english-cased and t5-v1_1-base-dutch-english-cased-1024 have the same model dimensions, but are pre-trained on different sequence lenghts, 512 and 1024 respectively. The evaluation loss and accuracy of the models do not look too different. Since training of the 1024 sequence length model was very slow and didn't converge a was was very slow, I stopped it early. The figure below shows the evaluation loss and accuracy.

T5 v11 base dutch english eval loss and accuracypng

The 512 sequence length model was trained for 10 epochs of the small nl+en config (186B tokens total) and the 1024 sequence length model about 2 epochs of the large nl+en config (100B tokens total). While I expected both models to perform similarly on downstream tasks, the 1024 sequence length model has better scores for both summarization and translation.

Some final notes:

Acknowledgements

This project would not have been possible without compute generously provided by Google through the TPU Research Cloud. The HuggingFace 🤗 ecosystem was instrumental in all parts of the training. Weights & Biases made it possible to keep track of many training sessions and orchestrate hyper-parameter sweeps with insightful visualizations.

Created by Yeb Havinga

Pre-trained Dutch and Dutch+English T5 models

Three types of T5 models have been trained. t5-base-dutch is the only model with an original T5 config. The other model types t5-v1.1 and t5-eff have gated-relu instead of relu as activation function, and trained with a drop-out of 0.0 unless training would diverge (t5-v1.1-large-dutch-cased). The T5-eff models are models that differ in their number of layers. The table will list the several dimensions of these models. Not all t5-eff models are efficient, the best example being the inefficient t5-xl-4L-dutch-english-cased.

t5-base-dutch t5-v1.1-base-dutch-uncased t5-v1.1-base-dutch-cased t5-v1.1-large-dutch-cased t5-v1_1-base-dutch-english-cased t5-v1_1-base-dutch-english-cased-1024 t5-small-24L-dutch-english t5-xl-4L-dutch-english-cased t5-base-36L-dutch-english-cased t5-eff-xl-8l-dutch-english-cased t5-eff-large-8l-dutch-english-cased
type t5 t5-v1.1 t5-v1.1 t5-v1.1 t5-v1.1 t5-v1.1 t5 eff t5 eff t5 eff t5 eff t5 eff
d_model 768 768 768 1024 768 768 512 2048 768 1024 1024
d_ff 3072 2048 2048 2816 2048 2048 1920 5120 2560 16384 4096
num_heads 12 12 12 16 12 12 8 32 12 32 16
d_kv 64 64 64 64 64 64 64 64 64 128 64
num_layers 12 12 12 24 12 12 24 4 36 8 8
num parameters 223M 248M 248M 783M 248M 248M 250M 585M 729M 1241M 335M
feed_forward_proj relu gated-gelu gated-gelu gated-gelu gated-gelu gated-gelu gated-gelu gated-gelu gated-gelu gated-gelu gated-gelu
dropout 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.1 0.0 0.0 0.0
dataset mc4_nl_cleaned mc4_nl_cleaned full mc4_nl_cleaned full mc4_nl_cleaned mc4_nl_cleaned small_en_nl mc4_nl_cleaned large_en_nl mc4_nl_cleaned large_en_nl mc4_nl_cleaned large_en_nl mc4_nl_cleaned large_en_nl mc4_nl_cleaned large_en_nl mc4_nl_cleaned large_en_nl
tr. seq len 512 1024 1024 512 512 1024 512 512 512 512 512
batch size 128 64 64 64 128 64 128 512 512 64 128
total steps 527500 1014525 1210154 1120k/2427498 2839630 1520k/3397024 851852 212963 212963 538k/1703705 851850
epochs 1 2 2 2 10 4 1 1 1 1 1
duration 2d9h 5d5h 6d6h 8d13h 11d18h 9d1h 4d10h 6d1h 17d15h 4d 19h 3d 23h
optimizer adafactor adafactor adafactor adafactor adafactor adafactor adafactor adafactor adafactor adafactor adafactor
lr 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.009 0.005 0.005
warmup 10000.0 10000.0 10000.0 10000.0 10000.0 5000.0 20000.0 2500.0 1000.0 1500.0 1500.0
eval loss 1,38 1,20 0,96 1,07 1,11 1,13 1,18 1,27 1,05 1,3019 1,15
eval acc 0,70 0,73 0,78 0,76 0,75 0,74 0,74 0,72 0,76 0,71 0,74

Fine-tuned translation models

The models t5-small-24L-dutch-english and t5-base-36L-dutch-english have been fine-tuned for both language directions on the first 25M samples from CCMatrix, giving a total of 50M training samples. Evaluation is performed on out-of-sample CCMatrix and also on Tatoeba and Opus Books. The _bp columns list the brevity penalty. The avg_bleu score is the bleu score averaged over all three evaluation datasets. The best scores displayed in bold for both translation directions.

t5-base-36L-ccmatrix-multi t5-base-36L-ccmatrix-multi t5-small-24L-ccmatrix-multi t5-small-24L-ccmatrix-multi
source_lang en nl en nl
target_lang nl en nl en
source_prefix translate English to Dutch: translate Dutch to English: translate English to Dutch: translate Dutch to English:
ccmatrix_bleu 56.8 62.8 57.4 63.1
tatoeba_bleu 46.6 52.8 46.4 51.7
opus_books_bleu 13.5 24.9 12.9 23.4
ccmatrix_bp 0.95 0.96 0.95 0.96
tatoeba_bp 0.97 0.94 0.98 0.94
opus_books_bp 0.8 0.94 0.77 0.89
avg_bleu 38.96 46.86 38.92 46.06
max_source_length 128 128 128 128
max_target_length 128 128 128 128
adam_beta1 0.9 0.9 0.9 0.9
adam_beta2 0.997 0.997 0.997 0.997
weight_decay 0.05 0.05 0.002 0.002
lr 5e-05 5e-05 0.0005 0.0005
label_smoothing_factor 0.15 0.15 0.1 0.1
train_batch_size 128 128 128 128
warmup_steps 2000 2000 2000 2000
total steps 390625 390625 390625 390625
duration 4d 5h 4d 5h 3d 2h 3d 2h
num parameters 729M 729M 250M 250M