Yeb Havinga
Autoupdate README.md
56a97d4
|
raw
history blame
20.9 kB
metadata
language:
  - nl
datasets:
  - yhavinga/mc4_nl_cleaned
tags:
  - t5
  - seq2seq
inference: false
license: apache-2.0

t5-v1_1-base-dutch-english-cased-1024

A T5 sequence to sequence model pre-trained from scratch on cleaned Dutch πŸ‡³πŸ‡±πŸ‡§πŸ‡ͺ mC4and cleaned English πŸ‡¬πŸ‡§ C4.

This t5-v1.1 model has 247M parameters. It was pre-trained on the dataset mc4_nl_cleaned config large_en_nl for 4 epoch(s) and a duration of 9d1h, with a sequence length of 1024, batch size 64 and 1520k/3397024 total steps. Pre-training evaluation loss and accuracy are 1,13 and 0,74. After fine-tuning on 25K samples of Dutch CNN summarization, the Rouge1 score is 34.6 (note: this evaluation model was not saved).

  • Pre-trained T5 models need to be finetuned before they can be used for downstream tasks, therefore the inference widget on the right has been turned off.
  • For a demo of the Dutch CNN summarization models, head over to the Hugging Face Spaces for the Netherformer πŸ“° example application!

Please refer to the original T5 papers and Scale Efficiently papers for more information about the T5 architecture and configs, though it must be noted that this model (t5-v1_1-base-dutch-english-cased-1024) is unrelated to these projects and not an 'official' checkpoint.

model image

Tokenizer

The model uses a cased SentencePiece tokenizer configured with the Nmt, NFKC, Replace multi-space to single-space normalizers and has 32003 tokens. It was trained on Dutch and English with scripts from the Huggingface Transformers Flax examples. See ./raw/main/tokenizer.json for details.

Dataset

All models listed below are trained on cleaned Dutch mC4, which is the original mC4, except

  • Documents that contained words from a selection of the Dutch and English List of Dirty Naught Obscene and Otherwise Bad Words are removed
  • Sentences with less than 3 words are removed
  • Sentences with a word of more than 1000 characters are removed
  • Documents with less than 5 sentences are removed
  • Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.

The Dutch and English models are trained on a 50/50% mix of Dutch mC4 and English C4.

Models

Three types of models have been trained. t5-base-dutch is the only model with an original T5 config. The other model types t5-v1.1 and t5-eff have gated-relu instead of relu as activation function, and trained with a drop-out of 0.0 unless training would diverge (t5-v1.1-large-dutch-cased). The T5-eff models are models with mostly different numbers of layers. The table will list the several dimensions of these models. Note that efficient is a misnomer for models with few layers, e.g. t5-xl-4L-dutch-english-cased, that is not efficient and one of the worst models on downstream summarization.

t5-base-dutch t5-v1.1-base-dutch-uncased t5-v1.1-base-dutch-cased t5-v1.1-large-dutch-cased t5-v1_1-base-dutch-english-cased t5-v1_1-base-dutch-english-cased-1024 t5-small-24L-dutch-english t5-xl-4L-dutch-english-cased t5-base-36L-dutch-english-cased t5-eff-xl-8l-dutch-english-cased t5-eff-large-8l-dutch-english-cased
type t5 t5-v1.1 t5-v1.1 t5-v1.1 t5-v1.1 t5-v1.1 t5 eff t5 eff t5 eff t5 eff t5 eff
d_model 768 768 768 1024 768 768 512 2048 768 1024 1024
d_ff 3072 2048 2048 2816 2048 2048 1920 5120 2560 16384 4096
num_heads 12 12 12 16 12 12 8 32 12 32 16
d_kv 64 64 64 64 64 64 64 64 64 128 64
num_layers 12 12 12 24 12 12 24 4 36 8 8
num parameters 223M 248M 248M 783M 248M 248M 250M 585M 729M 1241M 335M
feed_forward_proj relu gated-gelu gated-gelu gated-gelu gated-gelu gated-gelu gated-gelu gated-gelu gated-gelu gated-gelu gated-gelu
dropout 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.1 0.0 0.0 0.0
dataset mc4_nl_cleaned mc4_nl_cleaned full mc4_nl_cleaned full mc4_nl_cleaned mc4_nl_cleaned small_en_nl mc4_nl_cleaned large_en_nl mc4_nl_cleaned large_en_nl mc4_nl_cleaned large_en_nl mc4_nl_cleaned large_en_nl mc4_nl_cleaned large_en_nl mc4_nl_cleaned large_en_nl
tr. seq len 512 1024 1024 512 512 1024 512 512 512 512 512
batch size 128 64 64 64 128 64 128 512 512 64 128
total steps 527500 1014525 1210154 2427498 2839630 1520k/3397024 851852 212963 212963 538k/1703705 851850
epochs 1 2 2 2 10 4 1 1 1 1 1
duration 2d9h 5d5h 6d6h 8d13h 11d18h 9d1h 4d10h 6d1h 17d15h 4d 19h 3d 23h
optimizer adafactor adafactor adafactor adafactor adafactor adafactor adafactor adafactor adafactor adafactor adafactor
lr 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.009 0.005 0.005
warmup 10000.0 10000.0 10000.0 10000.0 10000.0 5000.0 20000.0 2500.0 1000.0 1500.0 1500.0
eval loss 1,38 1,20 0,96 1,07 1,11 1,13 1,18 1,27 1,05 1,3019 1,15
eval acc 0,70 0,73 0,78 0,76 0,75 0,74 0,74 0,72 0,76 0,71 0,74

Evaluation on summarization

The models below have been evaluated on the summarization downstream task on 50K samples from the CNN Dailymail dataset. All models were fine-tuned with the AdamW optimizer with a batch size of 128 and constant learning rate of 1e-3 after a warmup of 64 steps, with a label smoothing factor of 0.05. Article and summary token lengths were set to 1024 and 142.

t5-base-dutch t5-v1.1-base-dutch-uncased t5-v1.1-base-dutch-cased t5-v1_1-base-dutch-english-cased t5-v1_1-base-dutch-english-cased-1024 t5-small-24L-dutch-english t5-xl-4L-dutch-english-cased t5-base-36L-dutch-english-cased t5-eff-large-8l-dutch-english-cased mt5-base
rouge1 33.0313 33.8432 34.0906 33.1116 34.6465 34.376 30.8983 35.0931 33.9293 33.6466
rouge2 12.9452 13.7706 13.6203 13.275 13.8525 13.8939 11.6005 14.3823 13.6274 13.1085
rougeL 23.7204 24.5642 24.7304 24.3561 24.721 25.2496 22.6536 25.3213 24.5595 23.909
rougeLsum 29.842 30.7783 31.1438 30.0548 31.6104 31.3838 27.8467 32.3526 30.952 30.5054
gen_len 90.488 91.832 92.122 89.583 98.333 90.442 92.342 96.832 95.057 96.312
num parameters 223M 248M 248M 248M 248M 250M 585M 729M 335M 582M
samples_per_second 3.195 3.039 3.0 3.216 2.974 1.594 2.47 0.623 3.087 1.201

Translation models

The small 24L and base 36L models have been fine-tuned for translation on the CCMatrix dataset. The models named *-multi support both directions of translation. The models are trained on CCMatrix only. As this is a really large dataset with over 100M Dutch-English sentence pairs, the models are trained on a fraction of it, refer to the table below for how long. Evaluation is performed on a CCMatrix section not trained on, but also on Tatoeba and Opus Books. The _bp columns list the brevity penalty. The avg_bleu score is the bleu score averaged over all three evaluation datasets.

The translation metrics are listed in the table below:

t5-base-36L-ccmatrix-en-nl t5-base-36L-ccmatrix-multi t5-base-36L-ccmatrix-multi t5-small-24L-ccmatrix-multi t5-small-24L-ccmatrix-multi
id 0 14 15 16 20
source_lang en en nl en nl
target_lang nl nl en nl en
source_prefix translate English to Dutch: translate English to Dutch: translate Dutch to English: translate English to Dutch: translate Dutch to English:
tatoeba_bp 0.9897614370103832 0.9736173618072754 0.943521164106552 0.9760983304454847 0.9406676405486575
ccmatrix_bp 0.9590750786190209 0.9536276245543676 0.9635673583308255 0.9517934939463099 0.9585648049711814
opus_books_bp 0.7478011343203491 0.7950194726093107 0.9362852511299413 0.770498474692027 0.8870675076932444
tatoeba_score 50.63006965176505 46.580601850286214 52.82030981131822 46.419809813946046 51.67887417355214
ccmatrix_score 60.33227938980884 56.81297258845844 62.836646082246254 57.404319674892406 63.08633155239932
opus_books_score 10.405013868050663 13.477997378535864 24.93113308798125 12.927244801365507 23.418552148252047
avg_bleu 40.455787636541515 38.95719060576017 46.86269632718191 38.91712476340132 46.0612526247345
total steps 78125 390625 390625 390625 390625
duration 14h 101h 101h 74h 74h
num_parameters 728928000 728928000 728928000 249991680 249991680
label_smoothing_factor 0.09 0.15 0.15 0.1 0.1
learning_rate 0.0001 5e-05 5e-05 0.0005 0.0005

Acknowledgements

This project would not have been possible without compute generously provided by Google through the TPU Research Cloud. The HuggingFace πŸ€— ecosystem and was also instrumental all parts of the training. Logging metrics to Weights & Biases made it possible to keep track of many models and orchestrate hyper-parameter sweeps with insightful visualizations. I cannot imagine how I would have completed this project otherwise. The following repositories where helpful in setting up the TPU-VM, and getting an idea what sensible hyper-parameters are for training gpt2 from scratch.

Created by Yeb Havinga