yhavinga commited on
Commit
a7c6ef0
1 Parent(s): ce0eea3
Files changed (32) hide show
  1. README.md +194 -0
  2. added_tokens.json +1 -0
  3. config.gin +150 -0
  4. config.json +32 -0
  5. flax_model.msgpack +3 -0
  6. model-info.txt +0 -0
  7. pytorch_model.bin +3 -0
  8. special_tokens_map.json +107 -0
  9. spiece.model +3 -0
  10. spiece.vocab +0 -0
  11. tokenizer_config.json +113 -0
  12. train/events.out.tfevents.1671009166.t1v-n-a765f9c4-w-0.1571998.0.v2 +3 -0
  13. train/events.out.tfevents.1671181223.t1v-n-a765f9c4-w-0.3622139.0.v2 +3 -0
  14. train/events.out.tfevents.1671182911.t1v-n-a765f9c4-w-0.3644916.0.v2 +3 -0
  15. train/events.out.tfevents.1671186809.t1v-n-a765f9c4-w-0.15690.0.v2 +3 -0
  16. train/events.out.tfevents.1672218495.t1v-n-a765f9c4-w-0.1252081.0.v2 +3 -0
  17. train/events.out.tfevents.1672354765.t1v-n-a765f9c4-w-0.1407790.0.v2 +3 -0
  18. train/events.out.tfevents.1673086500.t1v-n-a765f9c4-w-0.2293433.0.v2 +3 -0
  19. training_eval/mc4_nl_ul2_denoising/events.out.tfevents.1671009166.t1v-n-a765f9c4-w-0.1571998.1.v2 +3 -0
  20. training_eval/mc4_nl_ul2_denoising/events.out.tfevents.1671181223.t1v-n-a765f9c4-w-0.3622139.1.v2 +3 -0
  21. training_eval/mc4_nl_ul2_denoising/events.out.tfevents.1671182911.t1v-n-a765f9c4-w-0.3644916.1.v2 +3 -0
  22. training_eval/mc4_nl_ul2_denoising/events.out.tfevents.1671186810.t1v-n-a765f9c4-w-0.15690.1.v2 +3 -0
  23. training_eval/mc4_nl_ul2_denoising/events.out.tfevents.1672218495.t1v-n-a765f9c4-w-0.1252081.1.v2 +3 -0
  24. training_eval/mc4_nl_ul2_denoising/events.out.tfevents.1672354765.t1v-n-a765f9c4-w-0.1407790.1.v2 +3 -0
  25. training_eval/mc4_nl_ul2_denoising/events.out.tfevents.1673086500.t1v-n-a765f9c4-w-0.2293433.1.v2 +3 -0
  26. training_eval/ul2_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1671009166.t1v-n-a765f9c4-w-0.1571998.2.v2 +3 -0
  27. training_eval/ul2_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1671181223.t1v-n-a765f9c4-w-0.3622139.2.v2 +3 -0
  28. training_eval/ul2_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1671182911.t1v-n-a765f9c4-w-0.3644916.2.v2 +3 -0
  29. training_eval/ul2_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1671186810.t1v-n-a765f9c4-w-0.15690.2.v2 +3 -0
  30. training_eval/ul2_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1672218495.t1v-n-a765f9c4-w-0.1252081.2.v2 +3 -0
  31. training_eval/ul2_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1672354765.t1v-n-a765f9c4-w-0.1407790.2.v2 +3 -0
  32. training_eval/ul2_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1673086500.t1v-n-a765f9c4-w-0.2293433.2.v2 +3 -0
README.md ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ language:
4
+ - nl
5
+ license: apache-2.0
6
+ tags:
7
+ - dutch
8
+ - t5
9
+ - t5x
10
+ - ul2
11
+ - seq2seq
12
+ datasets:
13
+ - yhavinga/mc4_nl_cleaned
14
+ - yhavinga/nedd_wiki_news
15
+ inference: false
16
+ ---
17
+
18
+ # ul2-base-nl36-dutch for Dutch
19
+
20
+ Pretrained T5 model on Dutch using a UL2 (Mixture-of-Denoisers) objective.
21
+ The T5 model was introduced in
22
+ [this paper](https://arxiv.org/abs/1910.10683)
23
+ and first released at [this page](https://github.com/google-research/text-to-text-transfer-transformer).
24
+ The UL2 objective was introduced in
25
+ [this paper](https://arxiv.org/abs/2205.05131)
26
+ and first released at [this page](https://github.com/google-research/google-research/tree/master/ul2).
27
+
28
+ **Note:** The Hugging Face inference widget is deactivated because this model needs a text-to-text fine-tuning on
29
+ a specific downstream task to be useful in practice.
30
+
31
+ ## Model description
32
+
33
+ T5 is an encoder-decoder model and treats all NLP problems in a text-to-text format.
34
+ `ul2-base-nl36-dutch` T5 is a transformers model pretrained on a very large corpus of
35
+ Dutch data in a self-supervised fashion.
36
+ This means it was pretrained on the raw texts only, with no humans labelling them in any way
37
+ (which is why it can use lots of publicly available data) with an automatic process to generate
38
+ inputs and outputs from those texts.
39
+
40
+
41
+ This model used the [T5 v1.1](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) improvements compared to the original T5 model during the pretraining:
42
+ - GEGLU activation in the feed-forward hidden layer, rather than ReLU - see [here](https://arxiv.org/abs/2002.05202)
43
+ - Dropout was turned off during pre-training. Dropout should be re-enabled during fine-tuning
44
+ - Pre-trained on self-supervised objective only without mixing in the downstream tasks
45
+ - No parameter sharing between embedding and classifier layer
46
+
47
+ The "efficient" T5 architecture findings presented in [this paper](https://arxiv.org/abs/2109.10686) were also applied,
48
+ which suggests that a Deep-Narrow model architecture is favorable for downstream performance compared to other model
49
+ architectures of similar parameter count. Specifically, the model depth is defined as the number of transformer blocks
50
+ that are stacked sequentially.
51
+ This model uses the [t5-efficient-base-nl36](https://huggingface.co/google/t5-efficient-base-nl36) architecture's
52
+ layer depth, which means both the encoder and the decoder have 36 transformer layers compared to the original T5 "base"
53
+ model's architecture of 12 transformer layers.
54
+
55
+ ### UL2 pretraining objective
56
+
57
+ This model was pretrained with the UL2's Mixture-of-Denoisers (MoD) objective, that combines diverse pre-training
58
+ paradigms together. UL2 frames different objective functions for training language models as denoising tasks, where
59
+ the model has to recover missing sub-sequences of a given input. During pre-training it uses a novel mixture-of-denoisers
60
+ that samples from a varied set of such objectives, each with different configurations. UL2 is trained using a mixture of
61
+ three denoising tasks:
62
+
63
+ 1. R-denoising (or regular span corruption), which emulates the standard T5 span corruption objective;
64
+ 2. X-denoising (or extreme span corruption); and
65
+ 3. S-denoising (or sequential PrefixLM).
66
+
67
+ During pre-training, we sample from the available denoising tasks based on user-specified ratios.
68
+ UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training
69
+ denoising task. During the pre-training, a paradigm token is inserted to the input
70
+ (`[NLU]` for R-denoising, `[NLG]` for X-denoising, or `[S2S]` for S-denoising) indicating the denoising task at hand.
71
+ Then, during fine-tuning the same input token should be inserted to get the best performance for different downstream
72
+ fine-tuning tasks.
73
+
74
+ ## Intended uses & limitations
75
+
76
+ This model was only pretrained in a self-supervised way excluding any supervised training.
77
+ Therefore, this model has to be fine-tuned before it is usable on a downstream task,
78
+ like text classification, unlike the Google's original T5 model.
79
+
80
+ **Note:** You most likely need to fine-tune these T5/UL2 models without mixed precision
81
+ so fine-tune them with full fp32 precision. Fine-tuning with Flax in bf16 - `model.to_bf16()` - is possible
82
+ if you set the mask correctly to exclude layernorm and embedding layers. Also note that the T5x pre-training
83
+ and fine-tuning configs set `z_loss` to 1e-4, which is used to keep the loss scale from underflowing.
84
+ You can also find more fine-tuning tips from [here](https://discuss.huggingface.co/t/t5-finetuning-tips), for example.
85
+
86
+ **Note**: For fine-tuning, most likely you can get better results if you insert a prefix token
87
+ of `[NLU]`, `[NLG]`, or `[S2S]` to your input texts.
88
+ For general language understanding fine-tuning tasks, you could use the `[NLU]` token.
89
+ For GPT-style causal language generation, you could use the `[S2S]` token.
90
+ The token `[NLG]` of the X-denoising pretrain task is somewhat mix between the language understanding and causal language
91
+ generation so the token `[NLG]` could maybe be used for language generation fine-tuning too.
92
+
93
+ ### How to use
94
+
95
+ Here is how to use this model in PyTorch:
96
+
97
+ ```python
98
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
99
+
100
+ tokenizer = T5Tokenizer.from_pretrained("yhavinga/ul2-base-nl36-dutch", use_fast=False)
101
+ model = T5ForConditionalGeneration.from_pretrained("yhavinga/ul2-base-nl36-dutch")
102
+ ```
103
+
104
+ and in Flax:
105
+
106
+ ```python
107
+ from transformers import T5Tokenizer, FlaxT5ForConditionalGeneration
108
+
109
+ tokenizer = T5Tokenizer.from_pretrained("yhavinga/ul2-base-nl36-dutch", use_fast=False)
110
+ model = FlaxT5ForConditionalGeneration.from_pretrained("yhavinga/ul2-base-nl36-dutch")
111
+ ```
112
+
113
+
114
+ ### Limitations and bias
115
+
116
+ The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral.
117
+ Therefore, the model can have biased predictions. This bias will also affect all fine-tuned versions of this model.
118
+
119
+ ## Training data
120
+
121
+ The `ul2-base-nl36-dutch` T5 model was pre-trained simultaneously on a combination of several datasets,
122
+ including the full version of the "mc4_nl_cleaned" dataset, which is a cleaned version of Common Crawl's web
123
+ crawl corpus, Dutch books, the Dutch subset of Wikipedia (2022-03-20), and a subset of "mc4_nl_cleaned"
124
+ containing only texts from Dutch and Belgian newspapers. This last dataset is oversampled to bias the model
125
+ towards descriptions of events in the Netherlands and Belgium.
126
+
127
+
128
+
129
+ ## Training procedure
130
+
131
+ ### Preprocessing
132
+
133
+ The ul2-base-nl36-dutch T5 model uses a SentencePiece unigram tokenizer with a vocabulary of 32,000 tokens.
134
+ The tokenizer includes the special tokens `<pad>`, `</s>`, `<unk>`, known from the original T5 paper,
135
+ `[NLU]`, `[NLG]` and `[S2S]` for the MoD pre-training, and `<n>` for newline.
136
+ During pre-training with the UL2 objective, input and output sequences consist of 512 consecutive tokens.
137
+ The tokenizer does not lowercase texts and is therefore case-sensitive; it distinguises
138
+ between `dutch` and `Dutch`.
139
+ Additionally, 100+28 extra tokens were added for pre-training tasks, resulting in a total of 32,128 tokens.
140
+
141
+ ### Pretraining
142
+ The model was trained on TPUv3-8 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/),
143
+ for 2000000 steps with a batch size of 64
144
+ (in total 65 B tokens).
145
+ The optimizer used was AdaFactor with learning rate warmup for 10K steps with a constant learning rate of 1e-2,
146
+ and then an inverse square root decay (exponential decay) of the learning rate after.
147
+ The model was trained with Google's Jax/Flax based [t5x framework](https://github.com/google-research/t5x) with help
148
+ from [Stephenn Fernandes](https://huggingface.co/StephennFernandes) to get started writing task definitions that wrap
149
+ HF datasets.
150
+
151
+ The UL2 training objective code used with the [t5x framework](https://github.com/google-research/t5x) was copied and
152
+ slightly modified from the [UL2 paper](https://arxiv.org/pdf/2205.05131.pdf) appendix chapter 9.2 by the authors
153
+ of the Finnish ul2 models. Used UL2 objective code is available in the repository
154
+ [Finnish-NLP/ul2-base-nl36-finnish](https://huggingface.co/Finnish-NLP/ul2-base-nl36-finnish) in the files `ul2_objective.py` and `tasks.py`.
155
+ UL2's mixture-of-denoisers configuration was otherwise equal to the UL2 paper
156
+ but for the rate of mixing denoisers, 20% for S-denoising was used (suggested at the paper chapter 4.5)
157
+ and the rest was divided equally between the R-denoising and X-denoising (i.e. 40% for both).
158
+ ### Model list
159
+
160
+ Models in this series:
161
+ | | ul2-base-dutch | ul2-base-nl36-dutch | ul2-large-dutch | ul2-small-dutch |
162
+ |:---------------------|:---------------------|:----------------------|:---------------------|:---------------------|
163
+ | model_type | t5 | t5 | t5 | t5 |
164
+ | _pipeline_tag | text2text-generation | text2text-generation | text2text-generation | text2text-generation |
165
+ | d_model | 768 | 768 | 1024 | 512 |
166
+ | d_ff | 2048 | 3072 | 2816 | 1024 |
167
+ | num_heads | 12 | 12 | 16 | 6 |
168
+ | d_kv | 64 | 64 | 64 | 64 |
169
+ | num_layers | 12 | 36 | 24 | 8 |
170
+ | num_decoder_layers | 12 | 36 | 24 | 8 |
171
+ | feed_forward_proj | gated-gelu | gated-gelu | gated-gelu | gated-gelu |
172
+ | dense_act_fn | gelu_new | gelu_new | gelu_new | gelu_new |
173
+ | vocab_size | 32128 | 32128 | 32128 | 32128 |
174
+ | tie_word_embeddings | 0 | 0 | 0 | 0 |
175
+ | torch_dtype | float32 | float32 | float32 | float32 |
176
+ | _gin_batch_size | 128 | 64 | 64 | 128 |
177
+ | _gin_z_loss | 0.0001 | 0.0001 | 0.0001 | 0.0001 |
178
+ | _gin_t5_config_dtype | 'bfloat16' | 'bfloat16' | 'bfloat16' | 'bfloat16' |
179
+
180
+
181
+
182
+ ## Evaluation results
183
+
184
+ See the evaluation section in the interactive [Pre-training Dutch T5 Models](https://huggingface.co/spaces/yhavinga/pre-training-dutch-t5-models) blog.
185
+
186
+ ## Acknowledgements
187
+
188
+ This project would not have been possible without compute generously provided by Google through the
189
+ [TPU Research Cloud](https://sites.research.google/trc/).
190
+ Thanks to the [Finnish-NLP](https://huggingface.co/Finnish-NLP) authors for releasing their code for the UL2 objective and associated task definitions.
191
+ Thanks to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for helping me get started with the t5x framework.
192
+
193
+ Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
194
+
added_tokens.json ADDED
@@ -0,0 +1 @@
 
1
+ {"[new_id_17]": 32117, "[new_id_20]": 32120, "[new_id_13]": 32113, "[new_id_2]": 32102, "[new_id_16]": 32116, "[new_id_7]": 32107, "[new_id_5]": 32105, "[new_id_1]": 32101, "[new_id_15]": 32115, "[new_id_12]": 32112, "[new_id_0]": 32100, "[new_id_11]": 32111, "[new_id_25]": 32125, "[new_id_24]": 32124, "[new_id_10]": 32110, "[new_id_27]": 32127, "[new_id_23]": 32123, "[new_id_14]": 32114, "[new_id_22]": 32122, "[new_id_21]": 32121, "[new_id_19]": 32119, "[new_id_3]": 32103, "[new_id_4]": 32104, "[new_id_18]": 32118, "[new_id_9]": 32109, "[new_id_8]": 32108, "[new_id_26]": 32126, "[new_id_6]": 32106}
config.gin ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __gin__ import dynamic_registration
2
+ import __main__ as train_script
3
+ import seqio
4
+ import t5.data.mixtures
5
+ from t5x import adafactor
6
+ from t5x.examples.t5 import network
7
+ from t5x import gin_utils
8
+ from t5x import models
9
+ from t5x import partitioning
10
+ from t5x import trainer
11
+ from t5x import utils
12
+ import tasks.nedd_tasks
13
+ import tasks.ul2_tasks as tasks2
14
+
15
+ # Macros:
16
+ # ==============================================================================
17
+ BATCH_SIZE = 64
18
+ DROPOUT_RATE = 0.0
19
+ LABEL_SMOOTHING = 0.0
20
+ LOSS_NORMALIZING_FACTOR = None
21
+ MIXTURE_OR_TASK_MODULE = None
22
+ MIXTURE_OR_TASK_NAME = 'ul2_mc4_nedd_wiki_news_mix_1'
23
+ MODEL = @models.EncoderDecoderModel()
24
+ MODEL_DIR = 'ul2_base_nl36_mc4_nedd_wiki_news_nl'
25
+ OPTIMIZER = @adafactor.Adafactor()
26
+ RANDOM_SEED = None
27
+ SHUFFLE_TRAIN_EXAMPLES = True
28
+ TASK_FEATURE_LENGTHS = {'inputs': 512, 'targets': 512}
29
+ TRAIN_STEPS = 2000000
30
+ USE_CACHED_TASKS = False
31
+ USE_HARDWARE_RNG = False
32
+ VOCABULARY = @seqio.SentencePieceVocabulary()
33
+ Z_LOSS = 0.0001
34
+
35
+ # Parameters for adafactor.Adafactor:
36
+ # ==============================================================================
37
+ adafactor.Adafactor.decay_rate = 0.8
38
+ adafactor.Adafactor.logical_factor_rules = \
39
+ @adafactor.standard_logical_factor_rules()
40
+ adafactor.Adafactor.step_offset = 0
41
+
42
+ # Parameters for utils.CheckpointConfig:
43
+ # ==============================================================================
44
+ utils.CheckpointConfig.restore = @utils.RestoreCheckpointConfig()
45
+ utils.CheckpointConfig.save = @utils.SaveCheckpointConfig()
46
+
47
+ # Parameters for utils.create_learning_rate_scheduler:
48
+ # ==============================================================================
49
+ utils.create_learning_rate_scheduler.base_learning_rate = 1.0
50
+ utils.create_learning_rate_scheduler.factors = 'constant * rsqrt_decay'
51
+ utils.create_learning_rate_scheduler.warmup_steps = 10000
52
+
53
+ # Parameters for train/utils.DatasetConfig:
54
+ # ==============================================================================
55
+ train/utils.DatasetConfig.batch_size = %BATCH_SIZE
56
+ train/utils.DatasetConfig.mixture_or_task_name = %MIXTURE_OR_TASK_NAME
57
+ train/utils.DatasetConfig.module = %MIXTURE_OR_TASK_MODULE
58
+ train/utils.DatasetConfig.pack = True
59
+ train/utils.DatasetConfig.seed = None
60
+ train/utils.DatasetConfig.shuffle = %SHUFFLE_TRAIN_EXAMPLES
61
+ train/utils.DatasetConfig.split = 'train'
62
+ train/utils.DatasetConfig.task_feature_lengths = %TASK_FEATURE_LENGTHS
63
+ train/utils.DatasetConfig.use_cached = %USE_CACHED_TASKS
64
+
65
+ # Parameters for train_eval/utils.DatasetConfig:
66
+ # ==============================================================================
67
+ train_eval/utils.DatasetConfig.batch_size = %BATCH_SIZE
68
+ train_eval/utils.DatasetConfig.mixture_or_task_name = %MIXTURE_OR_TASK_NAME
69
+ train_eval/utils.DatasetConfig.module = %MIXTURE_OR_TASK_MODULE
70
+ train_eval/utils.DatasetConfig.pack = True
71
+ train_eval/utils.DatasetConfig.seed = 42
72
+ train_eval/utils.DatasetConfig.shuffle = False
73
+ train_eval/utils.DatasetConfig.split = 'validation'
74
+ train_eval/utils.DatasetConfig.task_feature_lengths = %TASK_FEATURE_LENGTHS
75
+ train_eval/utils.DatasetConfig.use_cached = %USE_CACHED_TASKS
76
+
77
+ # Parameters for models.EncoderDecoderModel:
78
+ # ==============================================================================
79
+ models.EncoderDecoderModel.input_vocabulary = %VOCABULARY
80
+ models.EncoderDecoderModel.label_smoothing = %LABEL_SMOOTHING
81
+ models.EncoderDecoderModel.loss_normalizing_factor = %LOSS_NORMALIZING_FACTOR
82
+ models.EncoderDecoderModel.module = @network.Transformer()
83
+ models.EncoderDecoderModel.optimizer_def = %OPTIMIZER
84
+ models.EncoderDecoderModel.output_vocabulary = %VOCABULARY
85
+ models.EncoderDecoderModel.z_loss = %Z_LOSS
86
+
87
+ # Parameters for partitioning.PjitPartitioner:
88
+ # ==============================================================================
89
+ partitioning.PjitPartitioner.logical_axis_rules = \
90
+ @partitioning.standard_logical_axis_rules()
91
+ partitioning.PjitPartitioner.model_parallel_submesh = None
92
+ partitioning.PjitPartitioner.num_partitions = 1
93
+
94
+ # Parameters for utils.RestoreCheckpointConfig:
95
+ # ==============================================================================
96
+ utils.RestoreCheckpointConfig.path = []
97
+
98
+ # Parameters for utils.SaveCheckpointConfig:
99
+ # ==============================================================================
100
+ utils.SaveCheckpointConfig.dtype = 'float32'
101
+ utils.SaveCheckpointConfig.keep = 4
102
+ utils.SaveCheckpointConfig.period = 50000
103
+ utils.SaveCheckpointConfig.save_dataset = False
104
+ utils.SaveCheckpointConfig.use_gda = False
105
+
106
+ # Parameters for seqio.SentencePieceVocabulary:
107
+ # ==============================================================================
108
+ seqio.SentencePieceVocabulary.sentencepiece_model_file = \
109
+ 'gs://t5-dutch-english/vocabs/nedd.32000.128extra/spiece.model'
110
+
111
+ # Parameters for network.T5Config:
112
+ # ==============================================================================
113
+ network.T5Config.dropout_rate = %DROPOUT_RATE
114
+ network.T5Config.dtype = 'bfloat16'
115
+ network.T5Config.emb_dim = 768
116
+ network.T5Config.head_dim = 64
117
+ network.T5Config.logits_via_embedding = False
118
+ network.T5Config.mlp_activations = ('gelu', 'linear')
119
+ network.T5Config.mlp_dim = 3072
120
+ network.T5Config.num_decoder_layers = 36
121
+ network.T5Config.num_encoder_layers = 36
122
+ network.T5Config.num_heads = 12
123
+ network.T5Config.vocab_size = 32128
124
+
125
+ # Parameters for train_script.train:
126
+ # ==============================================================================
127
+ train_script.train.checkpoint_cfg = @utils.CheckpointConfig()
128
+ train_script.train.eval_period = 2000
129
+ train_script.train.eval_steps = 20
130
+ train_script.train.infer_eval_dataset_cfg = None
131
+ train_script.train.model = %MODEL
132
+ train_script.train.model_dir = %MODEL_DIR
133
+ train_script.train.partitioner = @partitioning.PjitPartitioner()
134
+ train_script.train.random_seed = %RANDOM_SEED
135
+ train_script.train.stats_period = 100
136
+ train_script.train.summarize_config_fn = @gin_utils.summarize_gin_config
137
+ train_script.train.total_steps = %TRAIN_STEPS
138
+ train_script.train.train_dataset_cfg = @train/utils.DatasetConfig()
139
+ train_script.train.train_eval_dataset_cfg = @train_eval/utils.DatasetConfig()
140
+ train_script.train.trainer_cls = @trainer.Trainer
141
+ train_script.train.use_hardware_rng = %USE_HARDWARE_RNG
142
+
143
+ # Parameters for trainer.Trainer:
144
+ # ==============================================================================
145
+ trainer.Trainer.learning_rate_fn = @utils.create_learning_rate_scheduler()
146
+ trainer.Trainer.num_microbatches = None
147
+
148
+ # Parameters for network.Transformer:
149
+ # ==============================================================================
150
+ network.Transformer.config = @network.T5Config()
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "./",
3
+ "architectures": [
4
+ "T5ForConditionalGeneration"
5
+ ],
6
+ "d_ff": 3072,
7
+ "d_kv": 64,
8
+ "d_model": 768,
9
+ "decoder_start_token_id": 0,
10
+ "dense_act_fn": "gelu_new",
11
+ "dropout_rate": 0.1,
12
+ "eos_token_id": 1,
13
+ "feed_forward_proj": "gated-gelu",
14
+ "initializer_factor": 1.0,
15
+ "is_encoder_decoder": true,
16
+ "is_gated_act": true,
17
+ "layer_norm_epsilon": 1e-06,
18
+ "model_type": "t5",
19
+ "n_positions": 512,
20
+ "num_decoder_layers": 36,
21
+ "num_heads": 12,
22
+ "num_layers": 36,
23
+ "output_past": true,
24
+ "pad_token_id": 0,
25
+ "relative_attention_max_distance": 128,
26
+ "relative_attention_num_buckets": 32,
27
+ "tie_word_embeddings": false,
28
+ "torch_dtype": "float32",
29
+ "transformers_version": "4.24.0",
30
+ "use_cache": true,
31
+ "vocab_size": 32128
32
+ }
flax_model.msgpack ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4822de0624cecbeafd7cf9e96b05da3a8710d5e8d95aa7bb2e64e49ba8bf5338
3
+ size 3255640203
model-info.txt ADDED
The diff for this file is too large to render. See raw diff
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:420cf64c59fef54e31120eb75addd9ac6675c62fb0169e5d85bb81dfd3861d05
3
+ size 3255881749
special_tokens_map.json ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<extra_id_0>",
4
+ "<extra_id_1>",
5
+ "<extra_id_2>",
6
+ "<extra_id_3>",
7
+ "<extra_id_4>",
8
+ "<extra_id_5>",
9
+ "<extra_id_6>",
10
+ "<extra_id_7>",
11
+ "<extra_id_8>",
12
+ "<extra_id_9>",
13
+ "<extra_id_10>",
14
+ "<extra_id_11>",
15
+ "<extra_id_12>",
16
+ "<extra_id_13>",
17
+ "<extra_id_14>",
18
+ "<extra_id_15>",
19
+ "<extra_id_16>",
20
+ "<extra_id_17>",
21
+ "<extra_id_18>",
22
+ "<extra_id_19>",
23
+ "<extra_id_20>",
24
+ "<extra_id_21>",
25
+ "<extra_id_22>",
26
+ "<extra_id_23>",
27
+ "<extra_id_24>",
28
+ "<extra_id_25>",
29
+ "<extra_id_26>",
30
+ "<extra_id_27>",
31
+ "<extra_id_28>",
32
+ "<extra_id_29>",
33
+ "<extra_id_30>",
34
+ "<extra_id_31>",
35
+ "<extra_id_32>",
36
+ "<extra_id_33>",
37
+ "<extra_id_34>",
38
+ "<extra_id_35>",
39
+ "<extra_id_36>",
40
+ "<extra_id_37>",
41
+ "<extra_id_38>",
42
+ "<extra_id_39>",
43
+ "<extra_id_40>",
44
+ "<extra_id_41>",
45
+ "<extra_id_42>",
46
+ "<extra_id_43>",
47
+ "<extra_id_44>",
48
+ "<extra_id_45>",
49
+ "<extra_id_46>",
50
+ "<extra_id_47>",
51
+ "<extra_id_48>",
52
+ "<extra_id_49>",
53
+ "<extra_id_50>",
54
+ "<extra_id_51>",
55
+ "<extra_id_52>",
56
+ "<extra_id_53>",
57
+ "<extra_id_54>",
58
+ "<extra_id_55>",
59
+ "<extra_id_56>",
60
+ "<extra_id_57>",
61
+ "<extra_id_58>",
62
+ "<extra_id_59>",
63
+ "<extra_id_60>",
64
+ "<extra_id_61>",
65
+ "<extra_id_62>",
66
+ "<extra_id_63>",
67
+ "<extra_id_64>",
68
+ "<extra_id_65>",
69
+ "<extra_id_66>",
70
+ "<extra_id_67>",
71
+ "<extra_id_68>",
72
+ "<extra_id_69>",
73
+ "<extra_id_70>",
74
+ "<extra_id_71>",
75
+ "<extra_id_72>",
76
+ "<extra_id_73>",
77
+ "<extra_id_74>",
78
+ "<extra_id_75>",
79
+ "<extra_id_76>",
80
+ "<extra_id_77>",
81
+ "<extra_id_78>",
82
+ "<extra_id_79>",
83
+ "<extra_id_80>",
84
+ "<extra_id_81>",
85
+ "<extra_id_82>",
86
+ "<extra_id_83>",
87
+ "<extra_id_84>",
88
+ "<extra_id_85>",
89
+ "<extra_id_86>",
90
+ "<extra_id_87>",
91
+ "<extra_id_88>",
92
+ "<extra_id_89>",
93
+ "<extra_id_90>",
94
+ "<extra_id_91>",
95
+ "<extra_id_92>",
96
+ "<extra_id_93>",
97
+ "<extra_id_94>",
98
+ "<extra_id_95>",
99
+ "<extra_id_96>",
100
+ "<extra_id_97>",
101
+ "<extra_id_98>",
102
+ "<extra_id_99>"
103
+ ],
104
+ "eos_token": "</s>",
105
+ "pad_token": "<pad>",
106
+ "unk_token": "<unk>"
107
+ }
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:caa6e2f21aeec181276ab80273e3f869ce303ccb8602d68e0524783c3581092d
3
+ size 800223
spiece.vocab ADDED
The diff for this file is too large to render. See raw diff
tokenizer_config.json ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<extra_id_0>",
4
+ "<extra_id_1>",
5
+ "<extra_id_2>",
6
+ "<extra_id_3>",
7
+ "<extra_id_4>",
8
+ "<extra_id_5>",
9
+ "<extra_id_6>",
10
+ "<extra_id_7>",
11
+ "<extra_id_8>",
12
+ "<extra_id_9>",
13
+ "<extra_id_10>",
14
+ "<extra_id_11>",
15
+ "<extra_id_12>",
16
+ "<extra_id_13>",
17
+ "<extra_id_14>",
18
+ "<extra_id_15>",
19
+ "<extra_id_16>",
20
+ "<extra_id_17>",
21
+ "<extra_id_18>",
22
+ "<extra_id_19>",
23
+ "<extra_id_20>",
24
+ "<extra_id_21>",
25
+ "<extra_id_22>",
26
+ "<extra_id_23>",
27
+ "<extra_id_24>",
28
+ "<extra_id_25>",
29
+ "<extra_id_26>",
30
+ "<extra_id_27>",
31
+ "<extra_id_28>",
32
+ "<extra_id_29>",
33
+ "<extra_id_30>",
34
+ "<extra_id_31>",
35
+ "<extra_id_32>",
36
+ "<extra_id_33>",
37
+ "<extra_id_34>",
38
+ "<extra_id_35>",
39
+ "<extra_id_36>",
40
+ "<extra_id_37>",
41
+ "<extra_id_38>",
42
+ "<extra_id_39>",
43
+ "<extra_id_40>",
44
+ "<extra_id_41>",
45
+ "<extra_id_42>",
46
+ "<extra_id_43>",
47
+ "<extra_id_44>",
48
+ "<extra_id_45>",
49
+ "<extra_id_46>",
50
+ "<extra_id_47>",
51
+ "<extra_id_48>",
52
+ "<extra_id_49>",
53
+ "<extra_id_50>",
54
+ "<extra_id_51>",
55
+ "<extra_id_52>",
56
+ "<extra_id_53>",
57
+ "<extra_id_54>",
58
+ "<extra_id_55>",
59
+ "<extra_id_56>",
60
+ "<extra_id_57>",
61
+ "<extra_id_58>",
62
+ "<extra_id_59>",
63
+ "<extra_id_60>",
64
+ "<extra_id_61>",
65
+ "<extra_id_62>",
66
+ "<extra_id_63>",
67
+ "<extra_id_64>",
68
+ "<extra_id_65>",
69
+ "<extra_id_66>",
70
+ "<extra_id_67>",
71
+ "<extra_id_68>",
72
+ "<extra_id_69>",
73
+ "<extra_id_70>",
74
+ "<extra_id_71>",
75
+ "<extra_id_72>",
76
+ "<extra_id_73>",
77
+ "<extra_id_74>",
78
+ "<extra_id_75>",
79
+ "<extra_id_76>",
80
+ "<extra_id_77>",
81
+ "<extra_id_78>",
82
+ "<extra_id_79>",
83
+ "<extra_id_80>",
84
+ "<extra_id_81>",
85
+ "<extra_id_82>",
86
+ "<extra_id_83>",
87
+ "<extra_id_84>",
88
+ "<extra_id_85>",
89
+ "<extra_id_86>",
90
+ "<extra_id_87>",
91
+ "<extra_id_88>",
92
+ "<extra_id_89>",
93
+ "<extra_id_90>",
94
+ "<extra_id_91>",
95
+ "<extra_id_92>",
96
+ "<extra_id_93>",
97
+ "<extra_id_94>",
98
+ "<extra_id_95>",
99
+ "<extra_id_96>",
100
+ "<extra_id_97>",
101
+ "<extra_id_98>",
102
+ "<extra_id_99>"
103
+ ],
104
+ "eos_token": "</s>",
105
+ "extra_ids": 100,
106
+ "name_or_path": "yhavinga/ul2-base-en-nl",
107
+ "pad_token": "<pad>",
108
+ "sp_model_kwargs": {},
109
+ "special_tokens_map_file": null,
110
+ "tokenizer_class": "T5Tokenizer",
111
+ "unk_token": "<unk>",
112
+ "use_fast_tokenizer": false
113
+ }
train/events.out.tfevents.1671009166.t1v-n-a765f9c4-w-0.1571998.0.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d5378008b02090246f78dc1a97fd93f47e898c3c6be962008ff681861b25966c
3
+ size 3422391
train/events.out.tfevents.1671181223.t1v-n-a765f9c4-w-0.3622139.0.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c206fba36171a4e490aa3a9f4848b5ab506c4be5e902f18bf4dab8e398ae67ce
3
+ size 32010
train/events.out.tfevents.1671182911.t1v-n-a765f9c4-w-0.3644916.0.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0844c69dfaa941955ecd230e44ccb72bf655dac33c7b02b7d6cea8ab7827e611
3
+ size 57828
train/events.out.tfevents.1671186809.t1v-n-a765f9c4-w-0.15690.0.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:600a31665f4763c4e656ae50df9ceee0e511fc6a3974e90cce43d4b94827b204
3
+ size 14807054
train/events.out.tfevents.1672218495.t1v-n-a765f9c4-w-0.1252081.0.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a007441ac09e88b22680bbc6a2df94dfc78e874dbb0d2b2a4be694f68fca3322
3
+ size 2232668
train/events.out.tfevents.1672354765.t1v-n-a765f9c4-w-0.1407790.0.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bac9e76c755b3dd19c41fb197b77f4ae8327df5f2a2c75cbb88839aed4b653ba
3
+ size 16707826
train/events.out.tfevents.1673086500.t1v-n-a765f9c4-w-0.2293433.0.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3e1c6abca75552ae22c04eecc11fa1b8fedf3ea12e081eaf78eb68c1d594c42e
3
+ size 4971617
training_eval/mc4_nl_ul2_denoising/events.out.tfevents.1671009166.t1v-n-a765f9c4-w-0.1571998.1.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dcaf186afe9e0782f5dfe67f4adaf8dae1f37c34ad6356fcb8e41ca1856b5f12
3
+ size 151231
training_eval/mc4_nl_ul2_denoising/events.out.tfevents.1671181223.t1v-n-a765f9c4-w-0.3622139.1.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a24eb5730b3a266a582c1892a0ed545c22c8d02a881c41d0ab810c71317e6160
3
+ size 40
training_eval/mc4_nl_ul2_denoising/events.out.tfevents.1671182911.t1v-n-a765f9c4-w-0.3644916.1.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:879d2112d2daca7f47319f4e2a5779fe209e3d0668f4361a46508a6511e6fbd7
3
+ size 1885
training_eval/mc4_nl_ul2_denoising/events.out.tfevents.1671186810.t1v-n-a765f9c4-w-0.15690.1.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3219df7ab8f32e472a493561c6e5895aa8bf7013d8039b73fb1071f89fa1e808
3
+ size 654474
training_eval/mc4_nl_ul2_denoising/events.out.tfevents.1672218495.t1v-n-a765f9c4-w-0.1252081.1.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4a08b1511e5acc2490e4b7ce1f42377cb6ef66998801838c294a410bee67fd28
3
+ size 98630
training_eval/mc4_nl_ul2_denoising/events.out.tfevents.1672354765.t1v-n-a765f9c4-w-0.1407790.1.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:888d94bac6d13cdb1cb44c8cfda8cbad34c4e93b4450b575e893ac8d4fe4b667
3
+ size 738906
training_eval/mc4_nl_ul2_denoising/events.out.tfevents.1673086500.t1v-n-a765f9c4-w-0.2293433.1.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a5a521cb113febe190f79ae98c963ef60130ca6d6e83093aff2b840107835d54
3
+ size 220001
training_eval/ul2_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1671009166.t1v-n-a765f9c4-w-0.1571998.2.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:37ab3f489be27c59e1bded09dbcc3625181cb4cfe5ffcbc2053c43ec860e7382
3
+ size 151231
training_eval/ul2_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1671181223.t1v-n-a765f9c4-w-0.3622139.2.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a24eb5730b3a266a582c1892a0ed545c22c8d02a881c41d0ab810c71317e6160
3
+ size 40
training_eval/ul2_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1671182911.t1v-n-a765f9c4-w-0.3644916.2.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eaba7aaaa79ebfa21d036db95f4dd01ecdf6970afcc9611e5f8e779699194ef0
3
+ size 1885
training_eval/ul2_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1671186810.t1v-n-a765f9c4-w-0.15690.2.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:91a61296457305939c1bfc50cddf40febee35d37d3a2caafb6e1cf75f0c8a864
3
+ size 654474
training_eval/ul2_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1672218495.t1v-n-a765f9c4-w-0.1252081.2.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:13c0ca65df10216d1be4d9e1c245d4721d5f4869206f320281a03a4981d9ca6c
3
+ size 98630
training_eval/ul2_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1672354765.t1v-n-a765f9c4-w-0.1407790.2.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b9210a281bde0c07022c93b02624e710c30232fd5945e4fc741910d3de131640
3
+ size 738906
training_eval/ul2_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1673086500.t1v-n-a765f9c4-w-0.2293433.2.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bc5ba73631b44acadb118e8794687a15333decb2decad1f31ffe8a6bc52f2b36
3
+ size 220001