yhavinga commited on
Commit
d4f764b
1 Parent(s): 34972d6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -46
README.md CHANGED
@@ -10,49 +10,72 @@ license: apache-2.0
10
  inference: false
11
  ---
12
 
13
- # Work in progress. Dec 2021.
14
-
15
- # A collection of Dutch T5 models
16
-
17
- * Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
18
- * Continuation of work started during the [Hugging Face community week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google, for the project [Pre-train T5 from scratch in Dutch](https://discuss.huggingface.co/t/pretrain-t5-from-scratch-in-dutch/8109).
19
- * Using improved training script - no more exceptions during training, so no restarting required.
20
- * All models trained with tensorflow metrics.
21
- * Thanks to @gsarti for creating the [t5-flax-gcp repository](https://github.com/gsarti/t5-flax-gcp)!
22
-
23
-
24
- | |`t5-base-dutch` |`t5-v1.1-base-dutch` |`t5-v1.1-large-dutch-cased`| `t5-v1.1-base-dutch-uncased`|
25
- |-----------------------|-------------------------|-------------------------|---------------------------|-----------------------------|
26
- |`tokenizer` |`cased` |`uncased` |`cased` |`uncased` |
27
- |`source model config` |`google/t5-base` |`google/t5-v1_1-base` |`google/t5-v1_1-large` |`google/t5-v1_1_base` |
28
- |`dataset` |`yhavinga/mc4_nl_cleaned`|`yhavinga/mc4_nl_cleaned`|`yhavinga/mc4_nl_cleaned` |`yhavinga/mc4_nl_cleaned` |
29
- |`tpu vm` | two | one | three | one |
30
- |`finished` | | YES | | |
31
- |*Hyperparameters* | | | | |
32
- |`epochs` | 1 | 1 | 4 | 2 |
33
- |`per-device batch size`| 16 | 16 | 2 | 8 |
34
- |`tot. batch size` | 128 | 128 | 16 | ? |
35
- |`steps` | 508 976 | 508 976 | 8 428 012 | ? |
36
- |`max seq. length` | 512 | 512 | 1024 | 1024 |
37
- |`tot. tok. trained on` | 33B | 33B | 138B | ? |
38
- |`optimizer` | adafactor | adafactor | adafactor | adafactor |
39
- |`warmup steps` | 10000 | 10000 | 10000 | 10000 |
40
- |`learning rate` | 0.005 | 0.005 | 0.005 | 0.005 |
41
- |`weigth decay` | 0.01 | 0.01 | 0.01 | 0.001 |
42
- |`tie embeds` |`false` |`false` |`false` |`false` |
43
- |`validation split size`| 15K examples | 15K examples | 15K examples | 15K examples |
44
- |*Model config* | | | | |
45
- |`d_ff` | 3072 | 2048 | 2816 | 2048 |
46
- |`d_kv` | 64 | 64 | 64 | 64 |
47
- |`d_model` | 768 | 768 | 1024 | 768 |
48
- |`dropout rate` | 0.1 | 0.1 | 0.1 (0.0 wh. pre-train.) | 0.1 (0.0 wh. pre-train.) |
49
- |`ff projection` |`relu` |`gated-gelu` |`gated-gelu` |`gated-relu` |
50
- |`num decoder layers` | 12 | 12 | 24 | 12 |
51
- |`num heads` | 12 | 12 | 16 | 12 |
52
- |`num layers` | 12 | 12 | 24 | 12 |
53
- |`rel. attn. buckets` | 32 | 32 | 32 | 32 |
54
- |`vocab size` | 32103 | 32103 | 32103 | 32103 |
55
- |*Training time* | ~ 100 hours | 101 hours | ~ 370 hours | ? |
56
- |*Evaluation* | | | | |
57
- |`accuracy` | | 0.6976 | | |
58
- |`loss` | | 1.379 | | |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  inference: false
11
  ---
12
 
13
+ # T5-base pre-trained on cleaned Dutch mC4 🇳🇱
14
+
15
+ A [T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) v1.1 base model pre-trained from scratch on [Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned).
16
+
17
+ * Pre-trained T5 models need to be finetuned before they can be used for downstream tasks, therefore the inference widget on the right has been turned off.
18
+ * T5 paper: [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)
19
+
20
+ ![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67)
21
+
22
+ ## Tokenizer
23
+
24
+ * SentencePiece tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
25
+ Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
26
+
27
+ ## Dataset
28
+
29
+ All models listed below are trained on of the `full` configuration (39B tokens) of
30
+ [cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
31
+ which is the original mC4, except
32
+
33
+ * Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
34
+ * Sentences with less than 3 words are removed
35
+ * Sentences with a word of more than 1000 characters are removed
36
+ * Documents with less than 5 sentences are removed
37
+ * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
38
+ "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
39
+
40
+ ## Models
41
+
42
+ TL;DR: [yhavinga/t5-v1.1-base-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cased) is the best model.
43
+
44
+ * `yhavinga/t5-base-dutch` is a re-training of the Dutch T5 base v1.0 model trained during the summer 2021
45
+ Flax/Jax community week. Accuracy was improved from 0.64 to 0.70.
46
+ * The two T5 v1.1 base models are an uncased and cased version of `t5-v1.1-base`, again pre-trained from scratch on Dutch,
47
+ with a tokenizer also trained from scratch. The t5 v1.1 models are slightly different from the t5 models, and the
48
+ base models are trained with a dropout of 0.0. For fine-tuning it is intended to set this back to 0.1.
49
+ * The large cased model is a pre-trained Dutch version of `t5-v1.1-large`. Training of t5-v1.1-large proved difficult.
50
+ Without dropout regularization, the training would diverge at a certain point. With dropout training went better,
51
+ be it much slower than training the t5-model. At some point convergance was too slow to warrant further training.
52
+ The latest checkpoint, training scripts and metrics are available for reference. For actual fine-tuning the cased
53
+ base model is probably the better choice.
54
+
55
+ | | model | train seq len | acc | loss | batch size | epochs | steps | dropout | optim | lr | duration |
56
+ |---------------------------------------------------------------------------------------------------|---------|---------------|----------|----------|------------|--------|---------|---------|-----------|------|----------|
57
+ | [yhavinga/t5-base-dutch](https://huggingface.co/yhavinga/t5-base-dutch) | T5 | 512 | 0,70 | 1,38 | 128 | 1 | 528481 | 0.1 | adafactor | 5e-3 | 2d 9h |
58
+ | [yhavinga/t5-v1.1-base-dutch-uncased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-uncased) | t5-v1.1 | 1024 | 0,73 | 1,20 | 64 | 2 | 1014525 | 0.0 | adafactor | 5e-3 | 5d 5h |
59
+ | [yhavinga/t5-v1.1-base-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cased) | t5-v1.1 | 1024 | **0,78** | **0,96** | 64 | 2 | 1210000 | 0.0 | adafactor | 5e-3 | 6d 6h |
60
+ | [yhavinga/t5-v1.1-large-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-large-dutch-cased) | t5-v1.1 | 512 | 0,76 | 1,07 | 64 | 1 | 1120000 | 0.1 | adafactor | 5e-3 | 86 13h |
61
+
62
+ The cased t5-v1.1 Dutch models were fine-tuned on summarizing the CNN Daily Mail dataset.
63
+
64
+ | | model | input len | target len | Rouge1 | Rouge2 | RougeL | RougeLsum | Test Gen Len | epochs | batch size | steps | duration |
65
+ |-------------------------------------------------------------------------------------------------------|---------|-----------|------------|--------|--------|--------|-----------|--------------|--------|------------|-------|----------|
66
+ | [yhavinga/t5-v1.1-base-dutch-cnn-test](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cnn-test) | t5-v1.1 | 1024 | 96 | 34,8 | 13,6 | 25,2 | 32,1 | 79 | 6 | 64 | 26916 | 2h 40m |
67
+ | [yhavinga/t5-v1.1-large-dutch-cnn-test](https://huggingface.co/yhavinga/t5-v1.1-large-dutch-cnn-test) | t5-v1.1 | 1024 | 96 | 34,4 | 13,6 | 25,3 | 31,7 | 81 | 5 | 16 | 89720 | 11h |
68
+
69
+
70
+ ## Acknowledgements
71
+
72
+ This project would not have been possible without compute generously provided by Google through the
73
+ [TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
74
+ instrumental in many, if not all parts of the training. The following repositories where helpful in setting up the TPU-VM,
75
+ and training the models:
76
+
77
+ * [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
78
+ * [HUggingFace Flax MLM examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
79
+ * [Flax/Jax Community week t5-base-dutch](https://huggingface.co/flax-community/t5-base-dutch)
80
+
81
+ Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)