yhavinga commited on
Commit
80ae6b7
1 Parent(s): e12a837

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -0
README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - nl
4
+ datasets:
5
+ - yhavinga/mc4_nl_cleaned
6
+ tags:
7
+ - seq2seq
8
+ - lm-head
9
+ license: apache-2.0
10
+ inference: false
11
+ ---
12
+
13
+ # Work in progress. Dec 2021.
14
+
15
+ # A collection of Dutch T5 models
16
+
17
+ * Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
18
+ * Continuation of work started during the [Hugging Face community week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google, for the project [Pre-train T5 from scratch in Dutch](https://discuss.huggingface.co/t/pretrain-t5-from-scratch-in-dutch/8109).
19
+ * Using improved training script - no more exceptions during training, so no restarting required.
20
+ * All models trained with tensorflow metrics.
21
+ * Thanks to @gsarti for creating the [t5-flax-gcp repository](https://github.com/gsarti/t5-flax-gcp)!
22
+
23
+
24
+ | |`t5-base-dutch` |`t5-v1.1-base-dutch` |`t5-v1.1-large-dutch-cased`| `t5-v1.1-base-dutch-uncased`|
25
+ |-----------------------|-------------------------|-------------------------|---------------------------|-----------------------------|
26
+ |`tokenizer` |`cased` |`uncased` |`cased` |`uncased` |
27
+ |`source model config` |`google/t5-base` |`google/t5-v1_1-base` |`google/t5-v1_1-large` |`google/t5-v1_1_base` |
28
+ |`dataset` |`yhavinga/mc4_nl_cleaned`|`yhavinga/mc4_nl_cleaned`|`yhavinga/mc4_nl_cleaned` |`yhavinga/mc4_nl_cleaned` |
29
+ |`tpu vm` | two | one | three | one |
30
+ |`finished` | | YES | | |
31
+ |*Hyperparameters* | | | | |
32
+ |`epochs` | 1 | 1 | 4 | 2 |
33
+ |`per-device batch size`| 16 | 16 | 2 | 8 |
34
+ |`tot. batch size` | 128 | 128 | 16 | ? |
35
+ |`steps` | 508 976 | 508 976 | 8 428 012 | ? |
36
+ |`max seq. length` | 512 | 512 | 1024 | 1024 |
37
+ |`tot. tok. trained on` | 33B | 33B | 138B | ? |
38
+ |`optimizer` | adafactor | adafactor | adafactor | adafactor |
39
+ |`warmup steps` | 10000 | 10000 | 10000 | 10000 |
40
+ |`learning rate` | 0.005 | 0.005 | 0.005 | 0.005 |
41
+ |`weigth decay` | 0.01 | 0.01 | 0.01 | 0.001 |
42
+ |`tie embeds` |`false` |`false` |`false` |`false` |
43
+ |`validation split size`| 15K examples | 15K examples | 15K examples | 15K examples |
44
+ |*Model config* | | | | |
45
+ |`d_ff` | 3072 | 2048 | 2816 | 2048 |
46
+ |`d_kv` | 64 | 64 | 64 | 64 |
47
+ |`d_model` | 768 | 768 | 1024 | 768 |
48
+ |`dropout rate` | 0.1 | 0.1 | 0.1 (0.0 wh. pre-train.) | 0.1 (0.0 wh. pre-train.) |
49
+ |`ff projection` |`relu` |`gated-gelu` |`gated-gelu` |`gated-relu` |
50
+ |`num decoder layers` | 12 | 12 | 24 | 12 |
51
+ |`num heads` | 12 | 12 | 16 | 12 |
52
+ |`num layers` | 12 | 12 | 24 | 12 |
53
+ |`rel. attn. buckets` | 32 | 32 | 32 | 32 |
54
+ |`vocab size` | 32103 | 32103 | 32103 | 32103 |
55
+ |*Training time* | ~ 100 hours | 101 hours | ~ 370 hours | ? |
56
+ |*Evaluation* | | | | |
57
+ |`accuracy` | | 0.6976 | | |
58
+ |`loss` | | 1.379 | | |