## Evaluation

### Running evaluation runs

Each pre-trained model was evaluated by fine-tuning on summarization and translation. The learning-rate was set to
a constant schedule after a small warmup of 32 steps.
Fine-tuning for evaluation was done on a limited set of 50K examples from the fine-tuning datasets.

|                  | Summarization    | Translation       |
|-----------------:|------------------|-------------------|
|          Dataset | CNN Dailymail NL | CCMatrix en -> nl |
|   #train samples | 50K              | 50K               |
|        Optimizer | AdamW            | AdamW             |
|    learning rate | 0.001            | 0.0005            |
|    source length | 1024             | 128               |
|    target length | 142              | 128               |
|    #eval samples | 1000             | 1000              |
 | wandb link      | [eval_summ](https://wandb.ai/yepster/eval_dutch_cnndaily_202302_flax)|[eval_transl](https://wandb.ai/yepster/eval_dutch_ccmatrix_202302_flax) |

The graph below shows the Rouge1 score for the summarization runs, evaluated
after 25K and 50K examples on the [CNN Dailymail Dutch](https://huggingface.co/datasets/yhavinga/cnn_dailymail_dutch) dataset:

![Rouge1 summarization](eval_summ_rouge1_202302.png)

* Flan models perform almost instantly well on the summarization task, with `flan-t5-small`
  showing performance comparable to Dutch T5 base models.
* After 50K examples, the `ul2` models exhibit similar performance to the `flan` models.
* I am surprised by the consistent bad scores for the `long-t5` runs. I've retried the fine-tuning of these models with
  `float32` instead of `bfloat16`, but the results were the same. Maybe this is normal behaviour for these models
  targeted at dealing with longer sequence lengths.

The graph below shows the Bleu score for the translation runs, evaluated at step 25K and
50K on the [CCMatrix](https://huggingface.co/datasets/yhavinga/ccmatrix_en_nl) dataset, from
English to Dutch:

![Bleu score translation](eval_transl_bleu_202302.png)

* For the translation task from English to Dutch, the Dutch+English pre-trained models perform well. Also
  `ul2` pre-trained models are consistently better than their `Flan`, `T5 Dutch` and
  `mT5` counterparts.
* Like with the summarization task, the `long-t5` models show bad performance, even after 50K examples. I do not understand 
  cannot explain this at all for this translation task. With a sequence length of 128 input and output
  tokens, the sliding attention window with radius length 127 of the `long-t5` models should be able to handle this.

The figure below shows the evaluation scores for most models, with summarization Rouge1 on the x-axis (higher is better),
and translation English to Dutch Bleu score on the y-axis (higher is better).
The point size is proportional to the model size. UL2 models are blue, Flan models
red, mT5 green and the other models black.

![Evaluation T5 Dutch English](eval_t5_dutch_english.png)

* For clarity not all models are shown. `t5-base-36L-dutch-english-cased` is model with
  scores comparable to `ul2-large-dutch-english`, but with slower inference. All long-t5
  runs are left out, as well as the `t5-v1.1-large-dutch-cased` model whose translation fine-tuning
  diverged.
* Across the board, for translation the models pre-trained with Dutch+English or Dutch converge faster than other models.
  I was surprised to see `t5-xl-4l` among the best models on translation, as it has only 4 layers, and previous tests
  showed that it had a very bad performance (In those tests I had forgot to force set the dropout rate to 0.0, and
  apparently this model was very sensitive to dropout).