yhavinga
/

t5-base-dutch

@@ -4,34 +4,58 @@ language:
 datasets:
 - yhavinga/mc4_nl_cleaned
 tags:
 - seq2seq
-- lm-head
-license: apache-2.0
 inference: false
 ---
-# T5-base pre-trained on cleaned Dutch mC4 🇳🇱
-A [T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) v1.0 base model pre-trained from scratch on [Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned).
-* NB! The model [yhavinga/t5-v1.1-base-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cased) has a higher accuracy.
-* This model and the [flax-community/t5-base-dutch model](https://huggingface.co/flax-community/t5-base-dutch) now have the same latest checkpoint with accuracy 0.70 and loss 1,38 on the validation split.
 * Pre-trained T5 models need to be finetuned before they can be used for downstream tasks, therefore the inference widget on the right has been turned off.
-* For a fine-tuned version for summarization, see [yhavinga/t5-v1.1-base-dutch-cnn-test](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cnn-test).
 * For a demo of the Dutch CNN summarization models, head over to the Hugging Face Spaces for
 the **[Netherformer 📰](https://huggingface.co/spaces/flax-community/netherformer)** example application!
-* T5 paper: [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)
 ![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67)
-## Tokenizer
-* Tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
-  Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
 ## Dataset
-All models listed below are trained on of the `full` configuration (39B tokens) of
 [cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
 which is the original mC4, except
@@ -42,42 +66,99 @@ which is the original mC4, except
   * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
     "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
-## Models
-* The first model, `t5-base-dutch` is a re-training of the Dutch T5 base v1.0 model trained during the Flax/Jax community
-  week. With training complete, accuracy was improved from 0,64 to 0,70.
-* The second two models are a uncased and cased version of `t5-v1.1-base`, again pre-trained from scratch on Dutch,
-  with a tokenizer also trained from scratch. The t5 v1.1 models are slightly different from the t5 models, and the
-  base models are trained with a dropout of 0.0. For fine-tuning it is intended to set this back to 0.1.
-* The large cased model is a pre-trained Dutch version of `t5-v1.1-large`. Training of t5-v1.1-large proved difficult.
-  Without dropout regularization, the training would diverge at a certain point. With dropout training went better,
-  be it much slower than training the t5-model. At some point convergance was too slow to warrant further training.
-  The latest checkpoint, training scripts and metrics are available for reference. For actual fine-tuning the cased
-  base model is probably the better choice.
-|                            | model   | train seq len | acc      | loss     | batch size | epochs | steps   | dropout | optim     | lr   | duration |
-|----------------------------|---------|---------------|----------|----------|------------|--------|---------|---------|-----------|------|----------|
-| t5-base-dutch              | T5      | 512           | 0,70     | 1,38     | 128        | 1      | 528481  | 0.1     | adafactor | 5e-3 | 2d 9h    |
-| t5-v1.1-base-dutch-uncased | t5-v1.1 | 1024          | 0,73     | 1,20     | 64         | 2      | 1014525 | 0.0     | adafactor | 5e-3 | 5d 5h    |
-| t5-v1.1-base-dutch-cased   | t5-v1.1 | 1024          | **0,78** | **0,96** | 64         | 2      | 1210000 | 0.0     | adafactor | 5e-3 | 6d 6h    |
-| t5-v1.1-large-dutch-cased  | t5-v1.1 | 512           | 0,76     | 1,07     | 64         | 1      | 1120000 | 0.1     | adafactor | 5e-3 | 86 13h   |
-The cased t5-v1.1 Dutch models were fine-tuned on summarizing the CNN Daily Mail dataset.
-|                              | model   | input len | target len | Rouge1 | Rouge2 | RougeL | RougeLsum | Test Gen Len | epochs | batch size | steps | duration |
-|------------------------------|---------|-----------|------------|--------|--------|--------|-----------|--------------|--------|------------|-------|----------|
-| t5-v1.1-base-dutch-cnn-test  | t5-v1.1 | 1024      | 96         | 34,8   | 13,6   | 25,2   | 32,1      | 79           | 6      | 64         | 26916 | 2h 40m   |
-| t5-v1.1-large-dutch-cnn-test | t5-v1.1 | 1024      | 96         | 34,4   | 13,6   | 25,3   | 31,7      | 81           | 5      | 16         | 89720 | 11h      |
 ## Acknowledgements
 This project would not have been possible without compute generously provided by Google through the
-[TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
-instrumental in many, if not all parts of the training. The following repositories where helpful in setting up the TPU-VM,
 and getting an idea what sensible hyper-parameters are for training gpt2 from scratch.
 * [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
 * [Flax/Jax Community week t5-base-dutch](https://huggingface.co/flax-community/t5-base-dutch)
-Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)

 datasets:
 - yhavinga/mc4_nl_cleaned
 tags:
+- t5
 - seq2seq
 inference: false
+license: apache-2.0
 ---
+# t5-base-dutch
+Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
+& [Dat Nguyen](https://www.linkedin.com/in/dat-nguyen-49a641138/) during the [Hugging Face community week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google, for the project [Pre-train T5 from scratch in Dutch](https://discuss.huggingface.co/t/pretrain-t5-from-scratch-in-dutch/8109).
+See also the fine-tuned [t5-base-dutch-demo](https://huggingface.co/flax-community/t5-base-dutch-demo) model,
+and the demo application **[Netherformer 📰](https://huggingface.co/spaces/flax-community/netherformer)**,
+that are based on this model.
+**5 jan 2022: Model updated. Evaluation accuracy increased from 0.64 to 0.70.**
+**11 jan 2022: See also [yhavinga/t5-v1.1-base-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cased) with eval acc 0.78**
+This **t5** model has **222M** parameters.
+It was pre-trained on the dataset
+`mc4_nl_cleaned` config `full` for **1** epoch(s) and a duration of **2d9h**,
+with a sequence length of **512**, batch size **128** and **527500** total steps.
+Pre-training evaluation loss and accuracy are **1,38** and **0,70**.
+After fine-tuning on 25K samples of Dutch CNN summarization, the Rouge1 score is **33.0**
+(note: this evaluation model was not saved).
 * Pre-trained T5 models need to be finetuned before they can be used for downstream tasks, therefore the inference widget on the right has been turned off.
 * For a demo of the Dutch CNN summarization models, head over to the Hugging Face Spaces for
 the **[Netherformer 📰](https://huggingface.co/spaces/flax-community/netherformer)** example application!
+Please refer to the original T5 papers and Scale Efficiently papers for more information about the T5 architecture
+and configs, though it must be noted that this model (t5-base-dutch) is unrelated to these projects and not an 'official' checkpoint.
+* **[Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)** by *Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu*.
+* **[Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers](https://arxiv.org/abs/2109.10686)** by *Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler*.
 ![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67)
+## Tokenizer
+The model uses a cased SentencePiece tokenizer configured with the `Nmt, NFKC, Replace multi-space to single-space` normalizers
+and has 32003 tokens.
+It was trained on Dutch mc4 with scripts from the Huggingface Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
+See [./raw/main/tokenizer.json](tokenizer.json) for details.
 ## Dataset
+All models listed below are trained on
 [cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
 which is the original mC4, except
   * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
     "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
+The Dutch and English models are trained on a 50/50% mix of Dutch mC4 and English C4.
+## Models
+Three types of models have been trained. `t5-base-dutch` is the only model with an original T5 config.
+The other model types t5-v1.1 and t5-eff have `gated-relu` instead of `relu` as activation function,
+and trained with a drop-out of `0.0` unless training would diverge (`t5-v1.1-large-dutch-cased`).
+The T5-eff models are models with mostly different numbers of layers. The table will list
+the several dimensions of these models. Note that `efficient` is a misnomer for models with few layers,
+e.g. `t5-xl-4L-dutch-english-cased`, that is not efficient and one of the worst models on downstream summarization.
+|                   | t5-base-dutch   | t5-v1.1-base-dutch-uncased   | t5-v1.1-base-dutch-cased   | t5-v1.1-large-dutch-cased   | t5-v1_1-base-dutch-english-cased   | t5-v1_1-base-dutch-english-cased-1024   | t5-small-24L-dutch-english   | t5-xl-4L-dutch-english-cased   | t5-base-36L-dutch-english-cased   | t5-eff-xl-8l-dutch-english-cased   | t5-eff-large-8l-dutch-english-cased   |
+|:------------------|:----------------|:-----------------------------|:---------------------------|:----------------------------|:-----------------------------------|:----------------------------------------|:-----------------------------|:-------------------------------|:----------------------------------|:-----------------------------------|:--------------------------------------|
+| type              | t5              | t5-v1.1                      | t5-v1.1                    | t5-v1.1                     | t5-v1.1                            | t5-v1.1                                 | t5 eff                       | t5 eff                         | t5 eff                            | t5 eff                             | t5 eff                                |
+| d_model           | 768             | 768                          | 768                        | 1024                        | 768                                | 768                                     | 512                          | 2048                           | 768                               | 1024                               | 1024                                  |
+| d_ff              | 3072            | 2048                         | 2048                       | 2816                        | 2048                               | 2048                                    | 1920                         | 5120                           | 2560                              | 16384                              | 4096                                  |
+| num_heads         | 12              | 12                           | 12                         | 16                          | 12                                 | 12                                      | 8                            | 32                             | 12                                | 32                                 | 16                                    |
+| d_kv              | 64              | 64                           | 64                         | 64                          | 64                                 | 64                                      | 64                           | 64                             | 64                                | 128                                | 64                                    |
+| num_layers        | 12              | 12                           | 12                         | 24                          | 12                                 | 12                                      | 24                           | 4                              | 36                                | 8                                  | 8                                     |
+| num parameters    | 223M            | 248M                         | 248M                       | 783M                        | 248M                               | 248M                                    | 250M                         | 585M                           | 729M                              | 1241M                              | 335M                                  |
+| feed_forward_proj | relu            | gated-gelu                   | gated-gelu                 | gated-gelu                  | gated-gelu                         | gated-gelu                              | gated-gelu                   | gated-gelu                     | gated-gelu                        | gated-gelu                         | gated-gelu                            |
+| dropout           | 0.1             | 0.0                          | 0.0                        | 0.1                         | 0.0                                | 0.0                                     | 0.0                          | 0.1                            | 0.0                               | 0.0                                | 0.0                                   |
+| dataset           | mc4_nl_cleaned  | mc4_nl_cleaned full          | mc4_nl_cleaned full        | mc4_nl_cleaned              | mc4_nl_cleaned small_en_nl         | mc4_nl_cleaned large_en_nl              | mc4_nl_cleaned large_en_nl   | mc4_nl_cleaned large_en_nl     | mc4_nl_cleaned large_en_nl        | mc4_nl_cleaned large_en_nl         | mc4_nl_cleaned large_en_nl            |
+| tr. seq len       | 512             | 1024                         | 1024                       | 512                         | 512                                | 1024                                    | 512                          | 512                            | 512                               | 512                                | 512                                   |
+| batch size        | 128             | 64                           | 64                         | 64                          | 128                                | 64                                      | 128                          | 512                            | 512                               | 64                                 | 128                                   |
+| total steps       | 527500          | 1014525                      | 1210154                    | 2427498                     | 2839630                            | 1520k/3397024                           | 851852                       | 212963                         | 212963                            | 538k/1703705                       | 851850                                |
+| epochs            | 1               | 2                            | 2                          | 2                           | 10                                 | 4                                       | 1                            | 1                              | 1                                 | 1                                  | 1                                     |
+| duration          | 2d9h            | 5d5h                         | 6d6h                       | 8d13h                       | 11d18h                             | 9d1h                                    | 4d10h                        | 6d1h                           | 17d15h                            | 4d 19h                             | 3d 23h                                |
+| optimizer         | adafactor       | adafactor                    | adafactor                  | adafactor                   | adafactor                          | adafactor                               | adafactor                    | adafactor                      | adafactor                         | adafactor                          | adafactor                             |
+| lr                | 0.005           | 0.005                        | 0.005                      | 0.005                       | 0.005                              | 0.005                                   | 0.005                        | 0.005                          | 0.009                             | 0.005                              | 0.005                                 |
+| warmup            | 10000.0         | 10000.0                      | 10000.0                    | 10000.0                     | 10000.0                            | 5000.0                                  | 20000.0                      | 2500.0                         | 1000.0                            | 1500.0                             | 1500.0                                |
+| eval loss         | 1,38            | 1,20                         | 0,96                       | 1,07                        | 1,11                               | 1,13                                    | 1,18                         | 1,27                           | 1,05                              | 1,3019                             | 1,15                                  |
+| eval acc          | 0,70            | 0,73                         | 0,78                       | 0,76                        | 0,75                               | 0,74                                    | 0,74                         | 0,72                           | 0,76                              | 0,71                               | 0,74                                  |
+## Evaluation on summarization
+The models below have been evaluated on the summarization downstream task on 50K samples from the CNN Dailymail dataset.
+All models were fine-tuned with the AdamW optimizer with a batch size of 128 and constant learning rate of 1e-3 after a
+warmup of 64 steps, with a label smoothing factor of 0.05.
+Article and summary token lengths were set to 1024 and 142.
+|                    | t5-base-dutch   | t5-v1.1-base-dutch-uncased   | t5-v1.1-base-dutch-cased   | t5-v1_1-base-dutch-english-cased   | t5-v1_1-base-dutch-english-cased-1024   | t5-small-24L-dutch-english   | t5-xl-4L-dutch-english-cased   | t5-base-36L-dutch-english-cased   | t5-eff-large-8l-dutch-english-cased   | mt5-base   |
+|:-------------------|:----------------|:-----------------------------|:---------------------------|:-----------------------------------|:----------------------------------------|:-----------------------------|:-------------------------------|:----------------------------------|:--------------------------------------|:-----------|
+| rouge1             | 33.0313         | 33.8432                      | 34.0906                    | 33.1116                            | 34.6465                                 | 34.376                       | 30.8983                        | 35.0931                           | 33.9293                               | 33.6466    |
+| rouge2             | 12.9452         | 13.7706                      | 13.6203                    | 13.275                             | 13.8525                                 | 13.8939                      | 11.6005                        | 14.3823                           | 13.6274                               | 13.1085    |
+| rougeL             | 23.7204         | 24.5642                      | 24.7304                    | 24.3561                            | 24.721                                  | 25.2496                      | 22.6536                        | 25.3213                           | 24.5595                               | 23.909     |
+| rougeLsum          | 29.842          | 30.7783                      | 31.1438                    | 30.0548                            | 31.6104                                 | 31.3838                      | 27.8467                        | 32.3526                           | 30.952                                | 30.5054    |
+| gen_len            | 90.488          | 91.832                       | 92.122                     | 89.583                             | 98.333                                  | 90.442                       | 92.342                         | 96.832                            | 95.057                                | 96.312     |
+| num parameters     | 223M            | 248M                         | 248M                       | 248M                               | 248M                                    | 250M                         | 585M                           | 729M                              | 335M                                  | 582M       |
+| samples_per_second | 3.195           | 3.039                        | 3.0                        | 3.216                              | 2.974                                   | 1.594                        | 2.47                           | 0.623                             | 3.087                                 | 1.201      |
+## Translation models
+The small 24L and base 36L models have been fine-tuned for translation on the CCMatrix dataset.
+The models named *-`multi` support both directions of translation. The models are trained on CCMatrix only. As this is
+a really large dataset with over 100M Dutch-English sentence pairs, the models are trained on a fraction of it,
+refer to the table below for how long. Evaluation is performed on a CCMatrix section not trained on, but also
+on Tatoeba and Opus Books. The `_bp` columns list the *brevity penalty*. The `avg_bleu` score is the bleu score
+averaged over all three evaluation datasets.
+The translation metrics are listed in the table below:
+|                        | t5-base-36L-ccmatrix-en-nl   | t5-base-36L-ccmatrix-multi   | t5-base-36L-ccmatrix-multi   | t5-small-24L-ccmatrix-multi   | t5-small-24L-ccmatrix-multi   |
+|:-----------------------|:-----------------------------|:-----------------------------|:-----------------------------|:------------------------------|:------------------------------|
+| id                     | 0                            | 14                           | 15                           | 16                            | 20                            |
+| source_lang            | en                           | en                           | nl                           | en                            | nl                            |
+| target_lang            | nl                           | nl                           | en                           | nl                            | en                            |
+| source_prefix          | translate English to Dutch:  | translate English to Dutch:  | translate Dutch to English:  | translate English to Dutch:   | translate Dutch to English:   |
+| tatoeba_bp             | 0.9897614370103832           | 0.9736173618072754           | 0.943521164106552            | 0.9760983304454847            | 0.9406676405486575            |
+| ccmatrix_bp            | 0.9590750786190209           | 0.9536276245543676           | 0.9635673583308255           | 0.9517934939463099            | 0.9585648049711814            |
+| opus_books_bp          | 0.7478011343203491           | 0.7950194726093107           | 0.9362852511299413           | 0.770498474692027             | 0.8870675076932444            |
+| tatoeba_score          | 50.63006965176505            | 46.580601850286214           | 52.82030981131822            | 46.419809813946046            | 51.67887417355214             |
+| ccmatrix_score         | 60.33227938980884            | 56.81297258845844            | 62.836646082246254           | 57.404319674892406            | 63.08633155239932             |
+| opus_books_score       | 10.405013868050663           | 13.477997378535864           | 24.93113308798125            | 12.927244801365507            | 23.418552148252047            |
+| avg_bleu               | 40.455787636541515           | 38.95719060576017            | 46.86269632718191            | 38.91712476340132             | 46.0612526247345              |
+| total steps            | 78125                        | 390625                       | 390625                       | 390625                        | 390625                        |
+| duration               | 14h                          | 101h                         | 101h                         | 74h                           | 74h                           |
+| num_parameters         | 728928000                    | 728928000                    | 728928000                    | 249991680                     | 249991680                     |
+| label_smoothing_factor | 0.09                         | 0.15                         | 0.15                         | 0.1                           | 0.1                           |
+| learning_rate          | 0.0001                       | 5e-05                        | 5e-05                        | 0.0005                        | 0.0005                        |
 ## Acknowledgements
 This project would not have been possible without compute generously provided by Google through the
+[TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem and was also
+instrumental all parts of the training. Logging metrics to Weights & Biases made it possible to keep track of many
+models and orchestrate hyper-parameter sweeps with insightful visualizations. I cannot imagine how I would
+have completed this project otherwise.
+The following repositories where helpful in setting up the TPU-VM,
 and getting an idea what sensible hyper-parameters are for training gpt2 from scratch.
 * [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
 * [Flax/Jax Community week t5-base-dutch](https://huggingface.co/flax-community/t5-base-dutch)
+Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)