metadata

language:
  - nl
datasets:
  - yhavinga/mc4_nl_cleaned
tags:
  - seq2seq
  - lm-head
license: apache-2.0
inference: false

Work in progress. Dec 2021.

A collection of Dutch T5 models

Many thanks to the Google TPU Research Cloud for providing access to a TPU cluster!
Continuation of work started during the Hugging Face community week, organized by HuggingFace and TPU usage sponsored by Google, for the project Pre-train T5 from scratch in Dutch.
Using improved training script - no more exceptions during training, so no restarting required.
All models trained with tensorflow metrics.
Thanks to @gsarti for creating the t5-flax-gcp repository!

	`t5-base-dutch`	`t5-v1.1-base-dutch`	`t5-v1.1-large-dutch-cased`	`t5-v1.1-base-dutch-uncased`
`tokenizer`	`cased`	`uncased`	`cased`	`uncased`
`source model config`	`google/t5-base`	`google/t5-v1_1-base`	`google/t5-v1_1-large`	`google/t5-v1_1_base`
`dataset`	`yhavinga/mc4_nl_cleaned`	`yhavinga/mc4_nl_cleaned`	`yhavinga/mc4_nl_cleaned`	`yhavinga/mc4_nl_cleaned`
`tpu vm`	two	one	three	one
`finished`		YES
Hyperparameters
`epochs`	1	1	4	2
`per-device batch size`	16	16	2	8
`tot. batch size`	128	128	16	?
`steps`	508 976	508 976	8 428 012	?
`max seq. length`	512	512	1024	1024
`tot. tok. trained on`	33B	33B	138B	?
`optimizer`	adafactor	adafactor	adafactor	adafactor
`warmup steps`	10000	10000	10000	10000
`learning rate`	0.005	0.005	0.005	0.005
`weigth decay`	0.01	0.01	0.01	0.001
`tie embeds`	`false`	`false`	`false`	`false`
`validation split size`	15K examples	15K examples	15K examples	15K examples
Model config
`d_ff`	3072	2048	2816	2048
`d_kv`	64	64	64	64
`d_model`	768	768	1024	768
`dropout rate`	0.1	0.1	0.1 (0.0 wh. pre-train.)	0.1 (0.0 wh. pre-train.)
`ff projection`	`relu`	`gated-gelu`	`gated-gelu`	`gated-relu`
`num decoder layers`	12	12	24	12
`num heads`	12	12	16	12
`num layers`	12	12	24	12
`rel. attn. buckets`	32	32	32	32
`vocab size`	32103	32103	32103	32103
Training time	~ 100 hours	101 hours	~ 370 hours	?
Evaluation
`accuracy`		0.6976
`loss`		1.379