# T5-base pre-trained on cleaned Dutch mC4 🇳🇱

A T5 v1.0 base model pre-trained from scratch on Dutch mC4.

## Tokenizer

• Tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface Transformers Flax examples.

## Dataset

All models listed below are trained on of the full configuration (39B tokens) of cleaned Dutch mC4, which is the original mC4, except

• Documents that contained words from a selection of the Dutch and English List of Dirty Naught Obscene and Otherwise Bad Words are removed
• Sentences with less than 3 words are removed
• Sentences with a word of more than 1000 characters are removed
• Documents with less than 5 sentences are removed

## Models

• The first model, t5-base-dutch is a re-training of the Dutch T5 base v1.0 model trained during the Flax/Jax community week. With training complete, accuracy was improved from 0,64 to 0,70.
• The second two models are a uncased and cased version of t5-v1.1-base, again pre-trained from scratch on Dutch, with a tokenizer also trained from scratch. The t5 v1.1 models are slightly different from the t5 models, and the base models are trained with a dropout of 0.0. For fine-tuning it is intended to set this back to 0.1.
• The large cased model is a pre-trained Dutch version of t5-v1.1-large. Training of t5-v1.1-large proved difficult. Without dropout regularization, the training would diverge at a certain point. With dropout training went better, be it much slower than training the t5-model. At some point convergance was too slow to warrant further training. The latest checkpoint, training scripts and metrics are available for reference. For actual fine-tuning the cased base model is probably the better choice.
model train seq len acc loss batch size epochs steps dropout optim lr duration
t5-base-dutch T5 512 0,70 1,38 128 1 528481 0.1 adafactor 5e-3 2d 9h
t5-v1.1-base-dutch-uncased t5-v1.1 1024 0,73 1,20 64 2 1014525 0.0 adafactor 5e-3 5d 5h
t5-v1.1-base-dutch-cased t5-v1.1 1024 0,78 0,96 64 2 1210000 0.0 adafactor 5e-3 6d 6h
t5-v1.1-large-dutch-cased t5-v1.1 512 0,76 1,07 64 1 1120000 0.1 adafactor 5e-3 86 13h

The cased t5-v1.1 Dutch models were fine-tuned on summarizing the CNN Daily Mail dataset.

model input len target len Rouge1 Rouge2 RougeL RougeLsum Test Gen Len epochs batch size steps duration
t5-v1.1-base-dutch-cnn-test t5-v1.1 1024 96 34,8 13,6 25,2 32,1 79 6 64 26916 2h 40m
t5-v1.1-large-dutch-cnn-test t5-v1.1 1024 96 34,4 13,6 25,3 31,7 81 5 16 89720 11h

## Acknowledgements

This project would not have been possible without compute generously provided by Google through the TPU Research Cloud. The HuggingFace 🤗 ecosystem was also instrumental in many, if not all parts of the training. The following repositories where helpful in setting up the TPU-VM, and getting an idea what sensible hyper-parameters are for training gpt2 from scratch.

Created by Yeb Havinga

7
Hosted inference API

Inference API has been turned off for this model.