yhavinga
/

t5-v1.1-base-dutch-uncased

+---
+language:
+- nl
+datasets:
+- yhavinga/mc4_nl_cleaned
+tags:
+- seq2seq
+- lm-head
+license: apache-2.0
+inference: false
+---
+# Work in progress. Dec 2021.
+# A collection of Dutch T5 models
+* Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
+* Continuation of work started during the [Hugging Face community week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google, for the project [Pre-train T5 from scratch in Dutch](https://discuss.huggingface.co/t/pretrain-t5-from-scratch-in-dutch/8109).
+* Using improved training script - no more exceptions during training, so no restarting required.
+* All models trained with tensorflow metrics.
+* Thanks to @gsarti for creating the [t5-flax-gcp repository](https://github.com/gsarti/t5-flax-gcp)!
+|                       |`t5-base-dutch`          |`t5-v1.1-base-dutch`     |`t5-v1.1-large-dutch-cased`| `t5-v1.1-base-dutch-uncased`|
+|-----------------------|-------------------------|-------------------------|---------------------------|-----------------------------|
+|`tokenizer`            |`cased`                  |`uncased`                |`cased`                    |`uncased`                    |
+|`source model config`  |`google/t5-base`         |`google/t5-v1_1-base`    |`google/t5-v1_1-large`     |`google/t5-v1_1_base`        |
+|`dataset`              |`yhavinga/mc4_nl_cleaned`|`yhavinga/mc4_nl_cleaned`|`yhavinga/mc4_nl_cleaned`  |`yhavinga/mc4_nl_cleaned`    |
+|`tpu vm`               | two                     | one                     | three                     | one                         |
+|`finished`             |                         | YES                     |                           |                             |
+|*Hyperparameters*      |                         |                         |                           |                             |
+|`epochs`               | 1                       | 1                       | 4                         | 2                           |
+|`per-device batch size`| 16                      | 16                      | 2                         | 8                           |
+|`tot. batch size`      | 128                     | 128                     | 16                        | ?                           |
+|`steps`                | 508 976                 | 508 976                 | 8 428 012                 | ?                           |
+|`max seq. length`      | 512                     | 512                     | 1024                      | 1024                        |
+|`tot. tok. trained on` | 33B                     | 33B                     | 138B                      | ?                           |
+|`optimizer`            | adafactor               | adafactor               | adafactor                 | adafactor                   |
+|`warmup steps`         | 10000                   | 10000                   | 10000                     | 10000                       |
+|`learning rate`        | 0.005                   | 0.005                   | 0.005                     | 0.005                       |
+|`weigth decay`         | 0.01                    | 0.01                    | 0.01                      | 0.001                       |
+|`tie embeds`           |`false`                  |`false`                  |`false`                    |`false`                      |
+|`validation split size`| 15K examples            | 15K examples            | 15K examples              | 15K examples                |
+|*Model config*         |                         |                         |                           |                             |
+|`d_ff`                 | 3072                    | 2048                    | 2816                      | 2048                        |
+|`d_kv`                 | 64                      | 64                      | 64                        | 64                          |
+|`d_model`              | 768                     | 768                     | 1024                      | 768                         |
+|`dropout rate`         | 0.1                     | 0.1                     | 0.1 (0.0 wh. pre-train.)  | 0.1 (0.0 wh. pre-train.)    |
+|`ff projection`        |`relu`                   |`gated-gelu`             |`gated-gelu`               |`gated-relu`                 |
+|`num decoder layers`   | 12                      | 12                      | 24                        | 12                          |
+|`num heads`            | 12                      | 12                      | 16                        | 12                          |
+|`num layers`           | 12                      | 12                      | 24                        | 12                          |
+|`rel. attn. buckets`   | 32                      | 32                      | 32                        | 32                          |
+|`vocab size`           | 32103                   | 32103                   | 32103                     | 32103                       |
+|*Training time*        | ~ 100 hours             | 101 hours               | ~ 370 hours               | ?                           |
+|*Evaluation*           |                         |                         |                           |                             |
+|`accuracy`             |                         | 0.6976                  |                           |                             |
+|`loss`                 |                         | 1.379                   |                           |                             |