yhavinga commited on
Commit
6b8d575
β€’
1 Parent(s): 840c6e7

Small text updates.

Browse files
Files changed (2) hide show
  1. README.md +1 -1
  2. app.py +7 -5
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: Pre-training Dutch T5 Models, evaluation and model lists
3
  emoji: πŸš€
4
  colorFrom: blue
5
  colorTo: pink
1
  ---
2
+ title: Pre-training Dutch T5 and UL2 Models, evaluation and model lists
3
  emoji: πŸš€
4
  colorFrom: blue
5
  colorTo: pink
app.py CHANGED
@@ -320,10 +320,11 @@ mT5 green and the other models black.
320
  * For the translation task from English to Dutch, the Dutch+English pre-trained models perform well. Also
321
  `UL2 Dutch` pre-trained Dutch models are consistently better than their `Flan`, `T5 Dutch` and
322
  `mT5` counterparts of the comparable size.
323
- * Fine-tuning of `t5-v1.1-large-dutch-cased` failed with the fixed hyperparameters across all models.
 
324
  Since the `UL2` models are better across the board, I've disabled this model on the hub.
325
  * The `long-t5` models show bad performance on both tasks.
326
- I cannot explain this the translation task. With a sequence length of 128 input and output
327
  tokens, the sliding attention window with radius length 127 of the `long-t5` models should be able to handle this.
328
  I've retried the fine-tuning of these models with
329
  `float32` instead of `bfloat16`, but the results were the same. Maybe this is normal behaviour for these models
@@ -388,10 +389,11 @@ mT5 green and the other models black.
388
  """## Miscellaneous remarks
389
 
390
  * Use loss regularization when training with `bfloat16` for better results (more info below).
391
- * Be cautious of the dropout rate in the config.json file and consider training without it.
 
 
392
  Check in a model's `config.json` what the dropout rate has been set to. Unless you
393
  intend to run many epochs on the same data, its worth to try a training run without dropout.
394
- If you want to compare losses, be sure to set the dropout rate equal.
395
  The smaller models can probably always be trained without.
396
  * Training with more layers is much slower than you'd expect from the increased model size.
397
  It is also more difficult to get batch size and learning rate right. Below is a section
@@ -628,7 +630,7 @@ I am grateful to the [https://huggingface.co/Finnish-NLP](Finnish-NLP) authors f
628
  definitions, and to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for his support in getting me started with the T5X framework.
629
  Lastly, I want to express my gratitude to Google for their openness and generosity in releasing T5X and related repositories.
630
 
631
- Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
632
  Some of the sentences were reworded by ChatGPT.
633
  """
634
  )
320
  * For the translation task from English to Dutch, the Dutch+English pre-trained models perform well. Also
321
  `UL2 Dutch` pre-trained Dutch models are consistently better than their `Flan`, `T5 Dutch` and
322
  `mT5` counterparts of the comparable size.
323
+ * Fine-tuning of `t5-v1.1-large-dutch-cased` failed with the hyperparameters that were fixed to the same value for the
324
+ evaluation of every model.
325
  Since the `UL2` models are better across the board, I've disabled this model on the hub.
326
  * The `long-t5` models show bad performance on both tasks.
327
+ I cannot explain this, especially for the translation task. With a sequence length of 128 input and output
328
  tokens, the sliding attention window with radius length 127 of the `long-t5` models should be able to handle this.
329
  I've retried the fine-tuning of these models with
330
  `float32` instead of `bfloat16`, but the results were the same. Maybe this is normal behaviour for these models
389
  """## Miscellaneous remarks
390
 
391
  * Use loss regularization when training with `bfloat16` for better results (more info below).
392
+ * Be cautious of the dropout rate in the config.json file, as besides learning rate it is probably the most important
393
+ hyperparameter.
394
+ If you are evaluating different pre-trained models, be sure to fine-tune with dropout set equal.
395
  Check in a model's `config.json` what the dropout rate has been set to. Unless you
396
  intend to run many epochs on the same data, its worth to try a training run without dropout.
 
397
  The smaller models can probably always be trained without.
398
  * Training with more layers is much slower than you'd expect from the increased model size.
399
  It is also more difficult to get batch size and learning rate right. Below is a section
630
  definitions, and to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for his support in getting me started with the T5X framework.
631
  Lastly, I want to express my gratitude to Google for their openness and generosity in releasing T5X and related repositories.
632
 
633
+ Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/).
634
  Some of the sentences were reworded by ChatGPT.
635
  """
636
  )