metadata

language:
  - nl
datasets:
  - yhavinga/mc4_nl_cleaned
tags:
  - t5
  - seq2seq
inference: false
license: apache-2.0

t5-v1_1-base-dutch-english-cased-1024

A T5 sequence to sequence model pre-trained from scratch on cleaned Dutch 🇳🇱🇧🇪 mC4and cleaned English 🇬🇧 C4.

This t5-v1.1 model has 247M parameters. It was pre-trained on the dataset mc4_nl_cleaned config large_en_nl for 4 epoch(s) and a duration of 9d1h, with a sequence length of 1024, batch size 64 and 1520k/3397024 total steps. Pre-training evaluation loss and accuracy are 1,13 and 0,74. After fine-tuning on 25K samples of Dutch CNN summarization, the Rouge1 score is 34.6 (note: this evaluation model was not saved).

Pre-trained T5 models need to be finetuned before they can be used for downstream tasks, therefore the inference widget on the right has been turned off.
For a demo of the Dutch CNN summarization models, head over to the Hugging Face Spaces for the Netherformer 📰 example application!

Please refer to the original T5 papers and Scale Efficiently papers for more information about the T5 architecture and configs, though it must be noted that this model (t5-v1_1-base-dutch-english-cased-1024) is unrelated to these projects and not an 'official' checkpoint.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers by Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler.

Tokenizer

The model uses a cased SentencePiece tokenizer configured with the Nmt, NFKC, Replace multi-space to single-space normalizers and has 32003 tokens. It was trained on Dutch and English with scripts from the Huggingface Transformers Flax examples. See ./raw/main/tokenizer.json for details.

Dataset

All models listed below are trained on cleaned Dutch mC4, which is the original mC4, except

Documents that contained words from a selection of the Dutch and English List of Dirty Naught Obscene and Otherwise Bad Words are removed
Sentences with less than 3 words are removed
Sentences with a word of more than 1000 characters are removed
Documents with less than 5 sentences are removed
Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.

The Dutch and English models are trained on a 50/50% mix of Dutch mC4 and English C4.

Models

Three types of models have been trained. t5-base-dutch is the only model with an original T5 config. The other model types t5-v1.1 and t5-eff have gated-relu instead of relu as activation function, and trained with a drop-out of 0.0 unless training would diverge (t5-v1.1-large-dutch-cased). The T5-eff models are models with mostly different numbers of layers. The table will list the several dimensions of these models. Note that efficient is a misnomer for models with few layers, e.g. t5-xl-4L-dutch-english-cased, that is not efficient and one of the worst models on downstream summarization.

	t5-base-dutch	t5-v1.1-base-dutch-uncased	t5-v1.1-base-dutch-cased	t5-v1.1-large-dutch-cased	t5-v1_1-base-dutch-english-cased	t5-v1_1-base-dutch-english-cased-1024	t5-small-24L-dutch-english	t5-xl-4L-dutch-english-cased	t5-base-36L-dutch-english-cased	t5-eff-xl-8l-dutch-english-cased	t5-eff-large-8l-dutch-english-cased
type	t5	t5-v1.1	t5-v1.1	t5-v1.1	t5-v1.1	t5-v1.1	t5 eff	t5 eff	t5 eff	t5 eff	t5 eff
d_model	768	768	768	1024	768	768	512	2048	768	1024	1024
d_ff	3072	2048	2048	2816	2048	2048	1920	5120	2560	16384	4096
num_heads	12	12	12	16	12	12	8	32	12	32	16
d_kv	64	64	64	64	64	64	64	64	64	128	64
num_layers	12	12	12	24	12	12	24	4	36	8	8
num parameters	223M	248M	248M	783M	248M	248M	250M	585M	729M	1241M	335M
feed_forward_proj	relu	gated-gelu	gated-gelu	gated-gelu	gated-gelu	gated-gelu	gated-gelu	gated-gelu	gated-gelu	gated-gelu	gated-gelu
dropout	0.1	0.0	0.0	0.1	0.0	0.0	0.0	0.1	0.0	0.0	0.0
dataset	mc4_nl_cleaned	mc4_nl_cleaned full	mc4_nl_cleaned full	mc4_nl_cleaned	mc4_nl_cleaned small_en_nl	mc4_nl_cleaned large_en_nl	mc4_nl_cleaned large_en_nl	mc4_nl_cleaned large_en_nl	mc4_nl_cleaned large_en_nl	mc4_nl_cleaned large_en_nl	mc4_nl_cleaned large_en_nl
tr. seq len	512	1024	1024	512	512	1024	512	512	512	512	512
batch size	128	64	64	64	128	64	128	512	512	64	128
total steps	527500	1014525	1210154	2427498	2839630	1520k/3397024	851852	212963	212963	538k/1703705	851850
epochs	1	2	2	2	10	4	1	1	1	1	1
duration	2d9h	5d5h	6d6h	8d13h	11d18h	9d1h	4d10h	6d1h	17d15h	4d 19h	3d 23h
optimizer	adafactor	adafactor	adafactor	adafactor	adafactor	adafactor	adafactor	adafactor	adafactor	adafactor	adafactor
lr	0.005	0.005	0.005	0.005	0.005	0.005	0.005	0.005	0.009	0.005	0.005
warmup	10000.0	10000.0	10000.0	10000.0	10000.0	5000.0	20000.0	2500.0	1000.0	1500.0	1500.0
eval loss	1,38	1,20	0,96	1,07	1,11	1,13	1,18	1,27	1,05	1,3019	1,15
eval acc	0,70	0,73	0,78	0,76	0,75	0,74	0,74	0,72	0,76	0,71	0,74

Evaluation on summarization

The models below have been evaluated on the summarization downstream task on 50K samples from the CNN Dailymail dataset. All models were fine-tuned with the AdamW optimizer with a batch size of 128 and constant learning rate of 1e-3 after a warmup of 64 steps, with a label smoothing factor of 0.05. Article and summary token lengths were set to 1024 and 142.

	t5-base-dutch	t5-v1.1-base-dutch-uncased	t5-v1.1-base-dutch-cased	t5-v1_1-base-dutch-english-cased	t5-v1_1-base-dutch-english-cased-1024	t5-small-24L-dutch-english	t5-xl-4L-dutch-english-cased	t5-base-36L-dutch-english-cased	t5-eff-large-8l-dutch-english-cased	mt5-base
rouge1	33.0313	33.8432	34.0906	33.1116	34.6465	34.376	30.8983	35.0931	33.9293	33.6466
rouge2	12.9452	13.7706	13.6203	13.275	13.8525	13.8939	11.6005	14.3823	13.6274	13.1085
rougeL	23.7204	24.5642	24.7304	24.3561	24.721	25.2496	22.6536	25.3213	24.5595	23.909
rougeLsum	29.842	30.7783	31.1438	30.0548	31.6104	31.3838	27.8467	32.3526	30.952	30.5054
gen_len	90.488	91.832	92.122	89.583	98.333	90.442	92.342	96.832	95.057	96.312
num parameters	223M	248M	248M	248M	248M	250M	585M	729M	335M	582M
samples_per_second	3.195	3.039	3.0	3.216	2.974	1.594	2.47	0.623	3.087	1.201

Translation models

The small 24L and base 36L models have been fine-tuned for translation on the CCMatrix dataset. The models named *-multi support both directions of translation. The models are trained on CCMatrix only. As this is a really large dataset with over 100M Dutch-English sentence pairs, the models are trained on a fraction of it, refer to the table below for how long. Evaluation is performed on a CCMatrix section not trained on, but also on Tatoeba and Opus Books. The _bp columns list the brevity penalty. The avg_bleu score is the bleu score averaged over all three evaluation datasets.

The translation metrics are listed in the table below:

	t5-base-36L-ccmatrix-en-nl	t5-base-36L-ccmatrix-multi	t5-base-36L-ccmatrix-multi	t5-small-24L-ccmatrix-multi	t5-small-24L-ccmatrix-multi
id	0	14	15	16	20
source_lang	en	en	nl	en	nl
target_lang	nl	nl	en	nl	en
source_prefix	translate English to Dutch:	translate English to Dutch:	translate Dutch to English:	translate English to Dutch:	translate Dutch to English:
tatoeba_bp	0.9897614370103832	0.9736173618072754	0.943521164106552	0.9760983304454847	0.9406676405486575
ccmatrix_bp	0.9590750786190209	0.9536276245543676	0.9635673583308255	0.9517934939463099	0.9585648049711814
opus_books_bp	0.7478011343203491	0.7950194726093107	0.9362852511299413	0.770498474692027	0.8870675076932444
tatoeba_score	50.63006965176505	46.580601850286214	52.82030981131822	46.419809813946046	51.67887417355214
ccmatrix_score	60.33227938980884	56.81297258845844	62.836646082246254	57.404319674892406	63.08633155239932
opus_books_score	10.405013868050663	13.477997378535864	24.93113308798125	12.927244801365507	23.418552148252047
avg_bleu	40.455787636541515	38.95719060576017	46.86269632718191	38.91712476340132	46.0612526247345
total steps	78125	390625	390625	390625	390625
duration	14h	101h	101h	74h	74h
num_parameters	728928000	728928000	728928000	249991680	249991680
label_smoothing_factor	0.09	0.15	0.15	0.1	0.1
learning_rate	0.0001	5e-05	5e-05	0.0005	0.0005

Acknowledgements

This project would not have been possible without compute generously provided by Google through the TPU Research Cloud. The HuggingFace 🤗 ecosystem and was also instrumental all parts of the training. Logging metrics to Weights & Biases made it possible to keep track of many models and orchestrate hyper-parameter sweeps with insightful visualizations. I cannot imagine how I would have completed this project otherwise. The following repositories where helpful in setting up the TPU-VM, and getting an idea what sensible hyper-parameters are for training gpt2 from scratch.

Created by Yeb Havinga