Pre-training Dutch T5 models

TL;DR, Look below for the list of pre-trained Dutch and Dutch+English models.

A few months ago, I was given access to Google's TPU Research Cloud (TRC). My goal was to train several Dutch and Dutch+English T5 models, limited to model sizes that can run on a single GPU. T5 is a text-to-text transfer transformer, a neural network model with natural language text as input and output. It can be fine-tuned on a wide range of tasks.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
Multitask Prompted Training Enables Zero-Shot Task Generalization by Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, Thomas Wolf, Alexander M. Rush.
ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning by Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q. Tran, Dara Bahri, Jianmo Ni, Jai Gupta, Kai Hui, Sebastian Ruder, Donald Metzler.
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers by Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler.

Background on Google's TPU-VM and how to use the Huggingface transformers library to pre-train models can be found at the following pages:

This project is a continuation of the work I performed together with Dat Nguyen during the Flax/JAX Community Week to create a T5 model pre-trained from scratch on Dutch.

Pre-training

mC4 dataset

The multilingual C4 (mC4) dataset was created by the original T5 authors. It was prepared and released by AllenNLP on the Huggingface Dataset hub. Our team cleaned Dutch mC4 with code adapted from the C4 TensorFlow dataset, and used the resulting text files in the pre-training scripts. We also verified that Dutch C4 was deduplicated.

To be able to easily reuse this dataset for more pre-training sessions with Huggingfaces scripts, a Huggingface dataset was created: mc4_nl_cleaned. For Dutch and English training, a couple of additional configs were added to the generation script. These configs produce interleaved Dutch and English texts with a 1:1 ratio. For instance, the micro_en_nl config config mixes Dutch with English samples. The cleaned English C4 dataset is about 5 times larger (in compressed bytes) than the Dutch part. 1:1 interleaving with Dutch discards about 80% of English C4. The full cleaned Dutch mC4 dataset is 151GB, and still is (June '22) the largest Dutch cleaned corpus currently available on the HF Hub.

Unsupervised Training Objective

The Dutch and Dutch+English T5 models are pre-trained with the masked language modeling (MLM) "span corruption" objective. During pre-training, 15% of the tokens are masked and each span of masked tokens is replaced by a sentinel token.

Why are some models trained for multiple epochs on a smaller config?

When I was using an old version of the Flax T5 MLM pretraining script, I noticed that the per-batch training speed seemed slower at the beginning of epochs when a larger dataset config was used. Also, on large configs, batch shuffling would fail with a TPU out-of-memory error. For these reasons, I started experimenting with training for more epochs on smaller configs.

This should be ok. In the original T5 paper downstream performance was compared between training on 2³⁵ tokens vs training multiple epochs on a smaller part. 64 repeats of 2²⁹ tokens did not result in degraded downstream performance. The model yhavinga/t5-v1_1-base-dutch-english-cased is trained on the small config for 10 epochs.

In the end, a change to the pre-training script to perform batch shuffling (permuting an array) on the CPU instead of the accelerator device solved all related issues, and larger configs could be used without any issues.

Which optimizer and lr to use

During the Flax/Jax Community week we quickly decided on using Adafactor with learning rate 5e-3. I was sure that with more time, a better setting could be found. After performing seven sweeps with Adafactor, AdamW and Distributed Shampoo (experimental PJIT version from Dall-E mini), I gave up to find better settings. The graph below shows the runs from all 7 sweeps combined. Apologies for the legend, I cannot show the optimizer in the legend, because the initial version of the training script had the optimizer --adafactor as boolean, which I later changed to a string with the optimizer name. All runs in the graph below that get the loss below 4 use Adafactor. Peach-sweep-6 is dashed orange and has learning rate 5e-3.

Adafactor vs Adam vs Shampoo

While there probably is a setting that will allow Adam and Shampoo to also converge fast below loss 4.0, I was unable to find it. In a recent tweet Lucas Nestler had more success with Shampoo (https://twitter.com/_clashluke/status/1535994026876252160) so maybe I need to revisit the attempt with the latest upstream code bases.

Bfloat16 datatype and learning rate schedule

I had some additional options in the pre-training script that I wanted to use. An exponential decay learning rate schedule would allow me to pre-train for as long as desired, instead of a fixed number of steps. I was also keen to pre-train with bfloat16, for the reduced memory footprint and speed. This failed. The graph below shows different attempts with the legend showing the optimizer, dtype, learning rate, total batch size and lr-schedule to train t5-small-24L-dutch-english.

Bfloat16 vs Float32

In the end, all models released on the hub are trained with Flax in float32. For reference, I've ran Stas Bekman's script for bf16, fp16 or fp32 model pretrain detection.


     
    
      

     
                       name                        |  abs min  |  abs max  
---------------------------------------------------|-----------|-----------
yhavinga/t5-base-dutch                             | 1.757e-09 | 6.792e+01
yhavinga/t5-v1.1-base-dutch-uncased                | 1.218e-09 | 6.708e+02
yhavinga/t5-v1.1-base-dutch-cased                  | 3.009e-09 | 8.821e+02
yhavinga/t5-v1.1-large-dutch-cased                 | 0.000e+00 | 5.053e+03
yhavinga/t5-v1_1-base-dutch-english-cased          | 5.140e-09 | 3.111e+03
yhavinga/t5-v1_1-base-dutch-english-cased-1024     | 9.359e-10 | 1.308e+02
yhavinga/t5-small-24L-dutch-english                | 1.577e-09 | 1.276e+02
yhavinga/t5-xl-4L-dutch-english-cased              | 3.234e-11 | 3.986e+01
yhavinga/t5-base-36L-dutch-english-cased           | 2.409e-10 | 6.104e+01
yhavinga/t5-eff-xl-8l-dutch-english-cased          | 5.530e-10 | 8.912e+02
yhavinga/t5-eff-large-8l-dutch-english-cased       | 1.086e-10 | 5.128e+02
yhavinga/t5-base-36L-ccmatrix-multi                | 1.715e-11 | 3.746e+01
yhavinga/t5-small-24L-ccmatrix-multi               | 7.086e-10 | 1.053e+02

Fine-tuning

Training t5-base-36L-dutch-english

The following image shows the loss curves of the sessions in which I was trying to find the right combination of total batch size (by adjusting gradient accumulation), learning rate and datatype. Unfortunately, again I could not find a good setting for bfloat16. The three green runs are the ones that end up in t5-base-36L-dutch-english. Numbers shown are learning reate, dtype and total batch size.

t5 base 36L training losses

Evaluation

Optimizer and learning rate for summarization

Finetuning summarization requires more memory than translation due to the longer sequence lengths involved. I wondered if I could use Adafactor instead of Adam and ran a sweep to test this. The sweep was configured with Hyperband, so not all training runs completed to the end.

Optimizer Learning rate for summarization

The training losses are graphed below:

Training losses for summarization sweep

While the Adafactor run with learning rate 7e-4 came close to the Adam runs, the consistent stability of training with Adam made me stick with Adam as optimizer for evaluation runs on the several models. For translation the results were similar, though in the end I needed to configure a lower learning rate for all models to converge during fine-tuning.

Running evaluation runs

The original T5 paper evaluated by fine-tuning on downstream tasks with a constant learning rate of 0.001. According to the sweep 0.001 would work nicely with the Adam optimizer for summarization. A single model evaluation consisted of fine-tuning the model, followed by running predictions and metrics calculation on the test split. Fine-tuning for evaluation was done on a limited set of example from the fine-tuning datasets.

	Summarization	Translation
Dataset	CNN Dailymail NL	CCMatrix en -> nl
#Samples	50K	50K
Optimizer	Adam	Adam
learning rate	0.001	0.0005
source length	1024	128
target length	142	128
#eval samples	1000	1000

The graph below shows the train loss curves for the summarization runs:

Train loss evaluation T5 summarization

The graph below shows the train loss curves for the translation runs:

Train loss evaluation T5 translation

The figure below shows the evaluation scores, where the x-axis shows the translation Bleu score (higher is better) and y-axis the summarization Rouge1 translation score (higher is better). Point size is proportional to the model size. Models with faster inference speed are green, slower inference speed is plotted as blue.

Evaluation T5 Dutch English

While it is clear that the model t5-base-36L-dutch-english-cased (with 729M parameters) has the best scores, it also among the slowest models. The model t5-eff-large-8l-dutch-english-cased (with 335M parameters) has the second best training loss after 390 steps in both tasks, but with a 4 times faster inference. Surprizing is the difference between t5-v1_1-base-dutch-english-cased and t5-v1_1-base-dutch-english-cased-1024, most notable on the summarization task. This might be due to the difference in pre-training sequence length:

Sequence length 512 or 1024

The models t5-v1_1-base-dutch-english-cased and t5-v1_1-base-dutch-english-cased-1024 have the same model dimensions, but are pre-trained on different sequence lenghts, 512 and 1024 respectively. The evaluation loss and accuracy of the models do not look too different. Since training of the 1024 sequence length model was very slow and didn't converge a was was very slow, I stopped it early. The figure below shows the evaluation loss and accuracy.

T5 v11 base dutch english eval loss and accuracypng

The 512 sequence length model was trained for 10 epochs of the small nl+en config (186B tokens total) and the 1024 sequence length model about 2 epochs of the large nl+en config (100B tokens total). While I expected both models to perform similarly on downstream tasks, the 1024 sequence length model has better scores for both summarization and translation.

Some final notes:

Note: The t5-small model with 24 layers is not small.
Training with more layers is much slower than you'd expect from the increased model size. It is also more difficult to get batch size and learning rate right. See e.g. the section about finding the right hyperparameters for the base-36L training.
The 'larger' models are not only harder to pre-train, but also harder to fine-tune. The optimizer eats up a lot of space, and the amount of memory required also depends on the length of source and target sequences.
When iterating over models and running evaluation, a sqlite database can be used to scribble results on.
PyCharm. Remote debugging from your workstation to either a TPU VM or your deep-learning workstation gives very good insight into the data structures.
When increasing the batch size, increase the learning rate. bs * 2 -> lr * sqrt(2) is a good heuristic but mileage may vary.
Dropout or not. It is a regularization technique, but also takes up memory. First try without dropout. If that doesn't work, try it with dropout. The smaller models can probably be trained without.
Training in bfloat16 is hard to get right. If suspicious of a result, switch back to float32 first.
Translation evaluation: the low score of the 128 seq len models on opus books may be because of the brevity penaly... that books may have sentences longer than 128 tokens.
t5-eff-large-8l-dutch-english-cased has good aptitude for the translation task and is fast - good candidate for serious fine-tuning
t5-xl-4l-dutch-english-cased is both slow and exhibits bad fine-tuning performance.
Gradient accumulation in the flax s2s pmap script would be nice.
The dataset directly results output, for pre-training, fine-tuning and also evaluation. Next efforts should favor spending time on dataset cleaning. (The perplexity measure that the Bertin project uses might be useful to filter the dataset on, to reduce training time.)
Good Bleu score does not necessarily mean fluent text. Evaluation loss on a large translation dataset might be better suited for model comparison.

Acknowledgements

This project would not have been possible without compute generously provided by Google through the TPU Research Cloud. The HuggingFace 🤗 ecosystem was instrumental in all parts of the training. Weights & Biases made it possible to keep track of many training sessions and orchestrate hyper-parameter sweeps with insightful visualizations.

Created by Yeb Havinga

Pre-trained Dutch and Dutch+English T5 models

Three types of T5 models have been trained. t5-base-dutch is the only model with an original T5 config. The other model types t5-v1.1 and t5-eff have gated-relu instead of relu as activation function, and trained with a drop-out of 0.0 unless training would diverge (t5-v1.1-large-dutch-cased). The T5-eff models are models that differ in their number of layers. The table will list the several dimensions of these models. Not all t5-eff models are efficient, the best example being the inefficient t5-xl-4L-dutch-english-cased.

	t5-base-dutch	t5-v1.1-base-dutch-uncased	t5-v1.1-base-dutch-cased	t5-v1.1-large-dutch-cased	t5-v1_1-base-dutch-english-cased	t5-v1_1-base-dutch-english-cased-1024	t5-small-24L-dutch-english	t5-xl-4L-dutch-english-cased	t5-base-36L-dutch-english-cased	t5-eff-xl-8l-dutch-english-cased	t5-eff-large-8l-dutch-english-cased
type	t5	t5-v1.1	t5-v1.1	t5-v1.1	t5-v1.1	t5-v1.1	t5 eff	t5 eff	t5 eff	t5 eff	t5 eff
d_model	768	768	768	1024	768	768	512	2048	768	1024	1024
d_ff	3072	2048	2048	2816	2048	2048	1920	5120	2560	16384	4096
num_heads	12	12	12	16	12	12	8	32	12	32	16
d_kv	64	64	64	64	64	64	64	64	64	128	64
num_layers	12	12	12	24	12	12	24	4	36	8	8
num parameters	223M	248M	248M	783M	248M	248M	250M	585M	729M	1241M	335M
feed_forward_proj	relu	gated-gelu	gated-gelu	gated-gelu	gated-gelu	gated-gelu	gated-gelu	gated-gelu	gated-gelu	gated-gelu	gated-gelu
dropout	0.1	0.0	0.0	0.1	0.0	0.0	0.0	0.1	0.0	0.0	0.0
dataset	mc4_nl_cleaned	mc4_nl_cleaned full	mc4_nl_cleaned full	mc4_nl_cleaned	mc4_nl_cleaned small_en_nl	mc4_nl_cleaned large_en_nl	mc4_nl_cleaned large_en_nl	mc4_nl_cleaned large_en_nl	mc4_nl_cleaned large_en_nl	mc4_nl_cleaned large_en_nl	mc4_nl_cleaned large_en_nl
tr. seq len	512	1024	1024	512	512	1024	512	512	512	512	512
batch size	128	64	64	64	128	64	128	512	512	64	128
total steps	527500	1014525	1210154	1120k/2427498	2839630	1520k/3397024	851852	212963	212963	538k/1703705	851850
epochs	1	2	2	2	10	4	1	1	1	1	1
duration	2d9h	5d5h	6d6h	8d13h	11d18h	9d1h	4d10h	6d1h	17d15h	4d 19h	3d 23h
optimizer	adafactor	adafactor	adafactor	adafactor	adafactor	adafactor	adafactor	adafactor	adafactor	adafactor	adafactor
lr	0.005	0.005	0.005	0.005	0.005	0.005	0.005	0.005	0.009	0.005	0.005
warmup	10000.0	10000.0	10000.0	10000.0	10000.0	5000.0	20000.0	2500.0	1000.0	1500.0	1500.0
eval loss	1,38	1,20	0,96	1,07	1,11	1,13	1,18	1,27	1,05	1,3019	1,15
eval acc	0,70	0,73	0,78	0,76	0,75	0,74	0,74	0,72	0,76	0,71	0,74

Fine-tuned translation models

The models t5-small-24L-dutch-english and t5-base-36L-dutch-english have been fine-tuned for both language directions on the first 25M samples from CCMatrix, giving a total of 50M training samples. Evaluation is performed on out-of-sample CCMatrix and also on Tatoeba and Opus Books. The _bp columns list the brevity penalty. The avg_bleu score is the bleu score averaged over all three evaluation datasets. The best scores displayed in bold for both translation directions.

	t5-base-36L-ccmatrix-multi	t5-base-36L-ccmatrix-multi	t5-small-24L-ccmatrix-multi	t5-small-24L-ccmatrix-multi
source_lang	en	nl	en	nl
target_lang	nl	en	nl	en
source_prefix	translate English to Dutch:	translate Dutch to English:	translate English to Dutch:	translate Dutch to English:
ccmatrix_bleu	56.8	62.8	57.4	63.1
tatoeba_bleu	46.6	52.8	46.4	51.7
opus_books_bleu	13.5	24.9	12.9	23.4
ccmatrix_bp	0.95	0.96	0.95	0.96
tatoeba_bp	0.97	0.94	0.98	0.94
opus_books_bp	0.8	0.94	0.77	0.89
avg_bleu	38.96	46.86	38.92	46.06
max_source_length	128	128	128	128
max_target_length	128	128	128	128
adam_beta1	0.9	0.9	0.9	0.9
adam_beta2	0.997	0.997	0.997	0.997
weight_decay	0.05	0.05	0.002	0.002
lr	5e-05	5e-05	0.0005	0.0005
label_smoothing_factor	0.15	0.15	0.1	0.1
train_batch_size	128	128	128	128
warmup_steps	2000	2000	2000	2000
total steps	390625	390625	390625	390625
duration	4d 5h	4d 5h	3d 2h	3d 2h
num parameters	729M	729M	250M	250M