Evaluation
Running evaluation runs
Each pre-trained model was evaluated by fine-tuning on summarization and translation. The learning-rate was set to a constant schedule after a small warmup of 32 steps. Fine-tuning for evaluation was done on a limited set of 50K examples from the fine-tuning datasets.
Summarization | Translation | |
---|---|---|
Dataset | CNN Dailymail NL | CCMatrix en -> nl |
#train samples | 50K | 50K |
Optimizer | AdamW | AdamW |
learning rate | 0.001 | 0.0005 |
source length | 1024 | 128 |
target length | 142 | 128 |
#eval samples | 1000 | 1000 |
wandb link | eval_summ | eval_transl |
The graph below shows the Rouge1 score for the summarization runs, evaluated after 25K and 50K examples on the CNN Dailymail Dutch dataset:
- Flan models perform almost instantly well on the summarization task, with
flan-t5-small
showing performance comparable to Dutch T5 base models. - After 50K examples, the
ul2
models exhibit similar performance to theflan
models. - I am surprised by the consistent bad scores for the
long-t5
runs. I've retried the fine-tuning of these models withfloat32
instead ofbfloat16
, but the results were the same. Maybe this is normal behaviour for these models targeted at dealing with longer sequence lengths.
The graph below shows the Bleu score for the translation runs, evaluated at step 25K and 50K on the CCMatrix dataset, from English to Dutch:
- For the translation task from English to Dutch, the Dutch+English pre-trained models perform well. Also
ul2
pre-trained models are consistently better than theirFlan
,T5 Dutch
andmT5
counterparts. - Like with the summarization task, the
long-t5
models show bad performance, even after 50K examples. I do not understand cannot explain this at all for this translation task. With a sequence length of 128 input and output tokens, the sliding attention window with radius length 127 of thelong-t5
models should be able to handle this.
The figure below shows the evaluation scores for most models, with summarization Rouge1 on the x-axis (higher is better), and translation English to Dutch Bleu score on the y-axis (higher is better). The point size is proportional to the model size. UL2 models are blue, Flan models red, mT5 green and the other models black.
- For clarity not all models are shown.
t5-base-36L-dutch-english-cased
is model with scores comparable toul2-large-dutch-english
, but with slower inference. All long-t5 runs are left out, as well as thet5-v1.1-large-dutch-cased
model whose translation fine-tuning diverged. - Across the board, for translation the models pre-trained with Dutch+English or Dutch converge faster than other models.
I was surprised to see
t5-xl-4l
among the best models on translation, as it has only 4 layers, and previous tests showed that it had a very bad performance (In those tests I had forgot to force set the dropout rate to 0.0, and apparently this model was very sensitive to dropout).