yhavinga's picture
Switch to streamlit with markdown, add T5X pre-trained models
8f8e390

Evaluation

Running evaluation runs

Each pre-trained model was evaluated by fine-tuning on summarization and translation. The learning-rate was set to a constant schedule after a small warmup of 32 steps. Fine-tuning for evaluation was done on a limited set of 50K examples from the fine-tuning datasets.

Summarization Translation
Dataset CNN Dailymail NL CCMatrix en -> nl
#train samples 50K 50K
Optimizer AdamW AdamW
learning rate 0.001 0.0005
source length 1024 128
target length 142 128
#eval samples 1000 1000
wandb link eval_summ eval_transl

The graph below shows the Rouge1 score for the summarization runs, evaluated after 25K and 50K examples on the CNN Dailymail Dutch dataset:

Rouge1 summarization

  • Flan models perform almost instantly well on the summarization task, with flan-t5-small showing performance comparable to Dutch T5 base models.
  • After 50K examples, the ul2 models exhibit similar performance to the flan models.
  • I am surprised by the consistent bad scores for the long-t5 runs. I've retried the fine-tuning of these models with float32 instead of bfloat16, but the results were the same. Maybe this is normal behaviour for these models targeted at dealing with longer sequence lengths.

The graph below shows the Bleu score for the translation runs, evaluated at step 25K and 50K on the CCMatrix dataset, from English to Dutch:

Bleu score translation

  • For the translation task from English to Dutch, the Dutch+English pre-trained models perform well. Also ul2 pre-trained models are consistently better than their Flan, T5 Dutch and mT5 counterparts.
  • Like with the summarization task, the long-t5 models show bad performance, even after 50K examples. I do not understand cannot explain this at all for this translation task. With a sequence length of 128 input and output tokens, the sliding attention window with radius length 127 of the long-t5 models should be able to handle this.

The figure below shows the evaluation scores for most models, with summarization Rouge1 on the x-axis (higher is better), and translation English to Dutch Bleu score on the y-axis (higher is better). The point size is proportional to the model size. UL2 models are blue, Flan models red, mT5 green and the other models black.

Evaluation T5 Dutch English

  • For clarity not all models are shown. t5-base-36L-dutch-english-cased is model with scores comparable to ul2-large-dutch-english, but with slower inference. All long-t5 runs are left out, as well as the t5-v1.1-large-dutch-cased model whose translation fine-tuning diverged.
  • Across the board, for translation the models pre-trained with Dutch+English or Dutch converge faster than other models. I was surprised to see t5-xl-4l among the best models on translation, as it has only 4 layers, and previous tests showed that it had a very bad performance (In those tests I had forgot to force set the dropout rate to 0.0, and apparently this model was very sensitive to dropout).