---
language:
- nl
datasets:
- yhavinga/mc4_nl_cleaned
tags:
- seq2seq
- lm-head
license: apache-2.0
inference: false
---

# Work in progress. Dec 2021.

# A collection of Dutch T5 models

* Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
* Continuation of work started during the [Hugging Face community week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google, for the project [Pre-train T5 from scratch in Dutch](https://discuss.huggingface.co/t/pretrain-t5-from-scratch-in-dutch/8109).
* Using improved training script - no more exceptions during training, so no restarting required.
* All models trained with tensorflow metrics.
* Thanks to @gsarti for creating the [t5-flax-gcp repository](https://github.com/gsarti/t5-flax-gcp)!


|                       |`t5-base-dutch`          |`t5-v1.1-base-dutch`     |`t5-v1.1-large-dutch-cased`| `t5-v1.1-base-dutch-uncased`|
|-----------------------|-------------------------|-------------------------|---------------------------|-----------------------------|
|`tokenizer`            |`cased`                  |`uncased`                |`cased`                    |`uncased`                    |
|`source model config`  |`google/t5-base`         |`google/t5-v1_1-base`    |`google/t5-v1_1-large`     |`google/t5-v1_1_base`        |
|`dataset`              |`yhavinga/mc4_nl_cleaned`|`yhavinga/mc4_nl_cleaned`|`yhavinga/mc4_nl_cleaned`  |`yhavinga/mc4_nl_cleaned`    |
|`tpu vm`               | two                     | one                     | three                     | one                         |
|`finished`             |                         | YES                     |                           |                             |
|*Hyperparameters*      |                         |                         |                           |                             |
|`epochs`               | 1                       | 1                       | 4                         | 2                           |
|`per-device batch size`| 16                      | 16                      | 2                         | 8                           |
|`tot. batch size`      | 128                     | 128                     | 16                        | ?                           |
|`steps`                | 508 976                 | 508 976                 | 8 428 012                 | ?                           |
|`max seq. length`      | 512                     | 512                     | 1024                      | 1024                        |
|`tot. tok. trained on` | 33B                     | 33B                     | 138B                      | ?                           |
|`optimizer`            | adafactor               | adafactor               | adafactor                 | adafactor                   |
|`warmup steps`         | 10000                   | 10000                   | 10000                     | 10000                       |
|`learning rate`        | 0.005                   | 0.005                   | 0.005                     | 0.005                       |
|`weigth decay`         | 0.01                    | 0.01                    | 0.01                      | 0.001                       |
|`tie embeds`           |`false`                  |`false`                  |`false`                    |`false`                      |
|`validation split size`| 15K examples            | 15K examples            | 15K examples              | 15K examples                |
|*Model config*         |                         |                         |                           |                             |
|`d_ff`                 | 3072                    | 2048                    | 2816                      | 2048                        |
|`d_kv`                 | 64                      | 64                      | 64                        | 64                          |
|`d_model`              | 768                     | 768                     | 1024                      | 768                         |
|`dropout rate`         | 0.1                     | 0.1                     | 0.1 (0.0 wh. pre-train.)  | 0.1 (0.0 wh. pre-train.)    |
|`ff projection`        |`relu`                   |`gated-gelu`             |`gated-gelu`               |`gated-relu`                 |
|`num decoder layers`   | 12                      | 12                      | 24                        | 12                          |
|`num heads`            | 12                      | 12                      | 16                        | 12                          |
|`num layers`           | 12                      | 12                      | 24                        | 12                          |
|`rel. attn. buckets`   | 32                      | 32                      | 32                        | 32                          |
|`vocab size`           | 32103                   | 32103                   | 32103                     | 32103                       |
|*Training time*        | ~ 100 hours             | 101 hours               | ~ 370 hours               | ?                           |
|*Evaluation*           |                         |                         |                           |                             |
|`accuracy`             |                         | 0.6976                  |                           |                             |
|`loss`                 |                         | 1.379                   |                           |                             |