EVALUATION.md · yhavinga/pre-training-dutch-t5-models at 7fb92b3e5c04088160951382b0af138c5f7c037d

Evaluation

Running evaluation runs

Each pre-trained model was evaluated by fine-tuning on summarization and translation. The learning-rate was set to a constant schedule after a small warmup of 32 steps. Fine-tuning for evaluation was done on a limited set of 50K examples from the fine-tuning datasets.

	Summarization	Translation
Dataset	CNN Dailymail NL	CCMatrix en -> nl
#train samples	50K	50K
Optimizer	AdamW	AdamW
learning rate	0.001	0.0005
source length	1024	128
target length	142	128
#eval samples	1000	1000
wandb link	eval_summ	eval_transl

The graph below shows the Rouge1 score for the summarization runs, evaluated after 25K and 50K examples on the CNN Dailymail Dutch dataset:

Flan models perform almost instantly well on the summarization task, with flan-t5-small showing performance comparable to Dutch T5 base models.
After 50K examples, the ul2 models exhibit similar performance to the flan models.
I am surprised by the consistent bad scores for the long-t5 runs. I've retried the fine-tuning of these models with float32 instead of bfloat16, but the results were the same. Maybe this is normal behaviour for these models targeted at dealing with longer sequence lengths.

The graph below shows the Bleu score for the translation runs, evaluated at step 25K and 50K on the CCMatrix dataset, from English to Dutch:

For the translation task from English to Dutch, the Dutch+English pre-trained models perform well. Also ul2 pre-trained models are consistently better than their Flan, T5 Dutch and mT5 counterparts.
Like with the summarization task, the long-t5 models show bad performance, even after 50K examples. I do not understand cannot explain this at all for this translation task. With a sequence length of 128 input and output tokens, the sliding attention window with radius length 127 of the long-t5 models should be able to handle this.

The figure below shows the evaluation scores for most models, with summarization Rouge1 on the x-axis (higher is better), and translation English to Dutch Bleu score on the y-axis (higher is better). The point size is proportional to the model size. UL2 models are blue, Flan models red, mT5 green and the other models black.

For clarity not all models are shown. t5-base-36L-dutch-english-cased is model with scores comparable to ul2-large-dutch-english, but with slower inference. All long-t5 runs are left out, as well as the t5-v1.1-large-dutch-cased model whose translation fine-tuning diverged.
Across the board, for translation the models pre-trained with Dutch+English or Dutch converge faster than other models. I was surprised to see t5-xl-4l among the best models on translation, as it has only 4 layers, and previous tests showed that it had a very bad performance (In those tests I had forgot to force set the dropout rate to 0.0, and apparently this model was very sensitive to dropout).