yhavinga commited on
Commit
de550c0
•
1 Parent(s): e9fb21a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -8
README.md CHANGED
@@ -10,18 +10,71 @@ license: apache-2.0
10
  inference: false
11
  ---
12
 
13
- # Work in progress. Jan 2022
14
 
 
15
  This model is a re-training of the original [t5-base-dutch](https://huggingface.co/flax-community/t5-base-dutch) model that was trained during the summer 2021 HuggingFace Flax/Jax community week. The original training suffered from errors during training, and was not trained on the full dataset. These errors have been fixed and training was now completed. Both this model and the flax-community t5-base-dutch model now have the same latest checkpoint with accuracy 0.70 and loss 1,38 on the validation split.
16
 
17
  NB! Consider using [yhavinga/t5-v1.1-base-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cased) that has an accuracy of 0,78 and loss 0,96 on the validation split.
18
 
19
- These models need to be finetuned, therefore the inference widget on the right has been turned off.
 
20
 
21
- # A collection of Dutch T5 models
22
 
23
- * Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
24
- * Continuation of work started during the [Hugging Face community week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google, for the project [Pre-train T5 from scratch in Dutch](https://discuss.huggingface.co/t/pretrain-t5-from-scratch-in-dutch/8109).
25
- * Using improved training script - no more exceptions during training, so no restarting required.
26
- * All models trained with tensorboard metrics.
27
- * Thanks to @gsarti for creating the [t5-flax-gcp repository](https://github.com/gsarti/t5-flax-gcp)!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  inference: false
11
  ---
12
 
13
+ # T5-base pre-trained on cleaned Dutch mC4 🇳🇱
14
 
15
+ A T5-base model trained from scratch on Dutch.
16
  This model is a re-training of the original [t5-base-dutch](https://huggingface.co/flax-community/t5-base-dutch) model that was trained during the summer 2021 HuggingFace Flax/Jax community week. The original training suffered from errors during training, and was not trained on the full dataset. These errors have been fixed and training was now completed. Both this model and the flax-community t5-base-dutch model now have the same latest checkpoint with accuracy 0.70 and loss 1,38 on the validation split.
17
 
18
  NB! Consider using [yhavinga/t5-v1.1-base-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cased) that has an accuracy of 0,78 and loss 0,96 on the validation split.
19
 
20
+ These models need to be finetuned, therefore the inference widget on the right has been turned off. For a demo of the Dutch CNN summarization models, head over to the
21
+ Hugging Face Spaces for the **[Netherformer 📰](https://huggingface.co/spaces/flax-community/netherformer)** example application!
22
 
 
23
 
24
+ ## Tokenizer
25
+
26
+ * Tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
27
+ Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
28
+
29
+ ## Dataset
30
+
31
+ All models listed below are trained on of the `full` configuration (39B tokens) of
32
+ [cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
33
+ which is the original mC4, except
34
+
35
+ * Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
36
+ * Sentences with less than 3 words are removed
37
+ * Sentences with a word of more than 1000 characters are removed
38
+ * Documents with less than 5 sentences are removed
39
+ * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
40
+ "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
41
+
42
+ ## Models
43
+
44
+ * The first model, `t5-base-dutch` is a re-training of the Dutch T5 base v1.0 model trained during the Flax/Jax community
45
+ week. With training complete, accuracy was improved from 0,64 to 0,70.
46
+ * The second two models are a uncased and cased version of `t5-v1.1-base`, again pre-trained from scratch on Dutch,
47
+ with a tokenizer also trained from scratch. The t5 v1.1 models are slightly different from the t5 models, and the
48
+ base models are trained with a dropout of 0.0. For fine-tuning it is intended to set this back to 0.1.
49
+ * The large cased model is a pre-trained Dutch version of `t5-v1.1-large`. Training of t5-v1.1-large proved difficult.
50
+ Without dropout regularization, the training would diverge at a certain point. With dropout training went better,
51
+ be it much slower than training the t5-model. At some point convergance was too slow to warrant further training.
52
+ The latest checkpoint, training scripts and metrics are available for reference. For actual fine-tuning the cased
53
+ base model is probably the better choice.
54
+
55
+ | | model | train seq len | acc | loss | batch size | epochs | steps | dropout | optim | lr | duration |
56
+ |----------------------------|---------|---------------|----------|----------|------------|--------|---------|---------|-----------|------|----------|
57
+ | t5-base-dutch | T5 | 512 | 0,70 | 1,38 | 128 | 1 | 528481 | 0.1 | adafactor | 5e-3 | 2d 9h |
58
+ | t5-v1.1-base-dutch-uncased | t5-v1.1 | 1024 | 0,73 | 1,20 | 64 | 2 | 1014525 | 0.0 | adafactor | 5e-3 | 5d 5h |
59
+ | t5-v1.1-base-dutch-cased | t5-v1.1 | 1024 | **0,78** | **0,96** | 64 | 2 | 1210000 | 0.0 | adafactor | 5e-3 | 6d 6h |
60
+ | t5-v1.1-large-dutch-cased | t5-v1.1 | 512 | 0,76 | 1,07 | 64 | 1 | 1120000 | 0.1 | adafactor | 5e-3 | 86 13h |
61
+
62
+ The cased t5-v1.1 Dutch models were fine-tuned on summarizing the CNN Daily Mail dataset.
63
+
64
+ | | model | input len | target len | Rouge1 | Rouge2 | RougeL | RougeLsum | Test Gen Len | epochs | batch size | steps | duration |
65
+ |------------------------------|---------|-----------|------------|--------|--------|--------|-----------|--------------|--------|------------|-------|----------|
66
+ | t5-v1.1-base-dutch-cnn-test | t5-v1.1 | 1024 | 96 | 34,8 | 13,6 | 25,2 | 32,1 | 79 | 6 | 64 | 26916 | 2h 40m |
67
+ | t5-v1.1-large-dutch-cnn-test | t5-v1.1 | 1024 | 96 | 34,4 | 13,6 | 25,3 | 31,7 | 81 | 5 | 16 | 89720 | 11h |
68
+
69
+
70
+ ## Acknowledgements
71
+
72
+ This project would not have been possible without compute generously provided by Google through the
73
+ [TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
74
+ instrumental in many, if not all parts of the training. The following repositories where helpful in setting up the TPU-VM,
75
+ and getting an idea what sensible hyper-parameters are for training gpt2 from scratch.
76
+
77
+ * [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
78
+ * [Flax/Jax Community week t5-base-dutch](https://huggingface.co/flax-community/t5-base-dutch)
79
+
80
+ Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)