Add files
Browse files- README.md +203 -0
- added_tokens.json +1 -0
- config.gin +150 -0
- config.json +34 -0
- events.out.tfevents.1673453219.t1v-n-c82e3785-w-0.4133.0.v2 +3 -0
- flax_model.msgpack +3 -0
- pytorch_model.bin +3 -0
- run_s2s_ul2-base-nl36-neddx2-en-nl.sh +75 -0
- special_tokens_map.json +107 -0
- spiece.model +3 -0
- spiece.vocab +0 -0
- tokenizer_config.json +113 -0
- training_state.json +1 -0
README.md
ADDED
@@ -0,0 +1,203 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
---
|
3 |
+
language:
|
4 |
+
- nl
|
5 |
+
- en
|
6 |
+
- multilingual
|
7 |
+
license: apache-2.0
|
8 |
+
tags:
|
9 |
+
- dutch
|
10 |
+
- english
|
11 |
+
- t5
|
12 |
+
- t5x
|
13 |
+
- ul2
|
14 |
+
- seq2seq
|
15 |
+
- translation
|
16 |
+
datasets:
|
17 |
+
- yhavinga/mc4_nl_cleaned
|
18 |
+
- yhavinga/nedd_wiki_news
|
19 |
+
pipeline_tag: translation
|
20 |
+
widget:
|
21 |
+
- text: >-
|
22 |
+
Redistricting and West Virginia’s shrinking population forced the state’s
|
23 |
+
Republican Legislature to pit Mr. McKinley, a six-term Republican with a
|
24 |
+
pragmatic bent, against Mr. Mooney, who has served four terms marked more
|
25 |
+
by conservative rhetoric than legislative achievements.
|
26 |
+
- text: >-
|
27 |
+
It is a painful and tragic spectacle that rises before me: I have drawn
|
28 |
+
back the curtain from the rottenness of man. This word, in my mouth, is at
|
29 |
+
least free from one suspicion: that it involves a moral accusation against
|
30 |
+
humanity.
|
31 |
+
- text: >-
|
32 |
+
Young Wehling was hunched in his chair, his head in his hand. He was so
|
33 |
+
rumpled, so still and colorless as to be virtually invisible. His
|
34 |
+
camouflage was perfect, since the waiting room had a disorderly and
|
35 |
+
demoralized air, too. Chairs and ashtrays had been moved away from the
|
36 |
+
walls. The floor was paved with spattered dropcloths.
|
37 |
+
---
|
38 |
+
|
39 |
+
# ul2-base-nl36-en-nl for English to Dutch translation
|
40 |
+
|
41 |
+
Fine-tuned T5 model on English to Dutch translation that was pretrained on Dutch using a UL2 (Mixture-of-Denoisers) objective.
|
42 |
+
The T5 model was introduced in
|
43 |
+
[this paper](https://arxiv.org/abs/1910.10683)
|
44 |
+
and first released at [this page](https://github.com/google-research/text-to-text-transfer-transformer).
|
45 |
+
The UL2 objective was introduced in
|
46 |
+
[this paper](https://arxiv.org/abs/2205.05131)
|
47 |
+
and first released at [this page](https://github.com/google-research/google-research/tree/master/ul2).
|
48 |
+
|
49 |
+
|
50 |
+
|
51 |
+
## Model description
|
52 |
+
|
53 |
+
T5 is an encoder-decoder model and treats all NLP problems in a text-to-text format.
|
54 |
+
|
55 |
+
`ul2-base-nl36-en-nl` T5 is a transformers model fine-tuned on parallel sentence and paragraph pairs
|
56 |
+
sampled from books.
|
57 |
+
|
58 |
+
This model used the [T5 v1.1](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) improvements compared to the original T5 model during the pretraining:
|
59 |
+
- GEGLU activation in the feed-forward hidden layer, rather than ReLU - see [here](https://arxiv.org/abs/2002.05202)
|
60 |
+
- Dropout was turned off during pre-training. Dropout should be re-enabled during fine-tuning
|
61 |
+
- Pre-trained on self-supervised objective only without mixing in the downstream tasks
|
62 |
+
- No parameter sharing between embedding and classifier layer
|
63 |
+
|
64 |
+
The "efficient" T5 architecture findings presented in [this paper](https://arxiv.org/abs/2109.10686) were also applied,
|
65 |
+
which suggests that a Deep-Narrow model architecture is favorable for downstream performance compared to other model
|
66 |
+
architectures of similar parameter count. Specifically, the model depth is defined as the number of transformer blocks
|
67 |
+
that are stacked sequentially.
|
68 |
+
This model uses the [t5-efficient-base-nl36](https://huggingface.co/google/t5-efficient-base-nl36) architecture's
|
69 |
+
layer depth, which means both the encoder and the decoder have 36 transformer layers compared to the original T5 "base"
|
70 |
+
model's architecture of 12 transformer layers.
|
71 |
+
|
72 |
+
### UL2 pretraining objective
|
73 |
+
|
74 |
+
This model was pretrained with the UL2's Mixture-of-Denoisers (MoD) objective, that combines diverse pre-training
|
75 |
+
paradigms together. UL2 frames different objective functions for training language models as denoising tasks, where
|
76 |
+
the model has to recover missing sub-sequences of a given input. During pre-training it uses a novel mixture-of-denoisers
|
77 |
+
that samples from a varied set of such objectives, each with different configurations. UL2 is trained using a mixture of
|
78 |
+
three denoising tasks:
|
79 |
+
|
80 |
+
1. R-denoising (or regular span corruption), which emulates the standard T5 span corruption objective;
|
81 |
+
2. X-denoising (or extreme span corruption); and
|
82 |
+
3. S-denoising (or sequential PrefixLM).
|
83 |
+
|
84 |
+
During pre-training, we sample from the available denoising tasks based on user-specified ratios.
|
85 |
+
UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training
|
86 |
+
denoising task. During the pre-training, a paradigm token is inserted to the input
|
87 |
+
(`[NLU]` for R-denoising, `[NLG]` for X-denoising, or `[S2S]` for S-denoising) indicating the denoising task at hand.
|
88 |
+
Then, during fine-tuning the same input token should be inserted to get the best performance for different downstream
|
89 |
+
fine-tuning tasks.
|
90 |
+
|
91 |
+
## Intended uses & limitations
|
92 |
+
|
93 |
+
This model was fine-tuned on parallel sentence and paragraph pairs and can be used
|
94 |
+
for machine translation.
|
95 |
+
|
96 |
+
### How to use
|
97 |
+
|
98 |
+
Here is how to use this model in PyTorch:
|
99 |
+
|
100 |
+
```python
|
101 |
+
model_name = "yhavinga/ul2-base-nl36-en-nl"
|
102 |
+
from transformers import AutoTokenizer
|
103 |
+
from transformers import AutoModelForSeq2SeqLM
|
104 |
+
from transformers import pipeline
|
105 |
+
import torch
|
106 |
+
device_num = 0 if torch.cuda.is_available() else -1
|
107 |
+
device = "cpu" if device_num < 0 else f"cuda:{device_num}"
|
108 |
+
|
109 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
|
110 |
+
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, use_auth_token=True).to(
|
111 |
+
device
|
112 |
+
)
|
113 |
+
params = {"max_length": 370, "num_beams": 4, "early_stopping": True}
|
114 |
+
translator = pipeline("translation", tokenizer=tokenizer, model=model, device=device_num)
|
115 |
+
print(translator("Young Wehling was hunched in his chair, his head in his hand. He was so rumpled, so still and colorless as to be virtually invisible.",
|
116 |
+
**params)[0]['translation_text'])
|
117 |
+
```
|
118 |
+
|
119 |
+
|
120 |
+
### Limitations and bias
|
121 |
+
|
122 |
+
The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral.
|
123 |
+
Therefore, the model can have biased predictions. This bias will also affect all fine-tuned versions of this model.
|
124 |
+
|
125 |
+
## Training data
|
126 |
+
|
127 |
+
The `ul2-base-nl36-en-nl` T5 model was pre-trained simultaneously on a combination of several datasets,
|
128 |
+
including the `full` config of the "mc4_nl_cleaned" dataset, which is a cleaned version of Common Crawl's web
|
129 |
+
crawl corpus, Dutch books, the Dutch subset of Wikipedia (2022-03-20), and a subset of "mc4_nl_cleaned"
|
130 |
+
containing only texts from Dutch and Belgian newspapers. This last dataset is oversampled to bias the model
|
131 |
+
towards descriptions of events in the Netherlands and Belgium.
|
132 |
+
|
133 |
+
After pre-training, the model was
|
134 |
+
fine-tuned on a translation dataset containing 13 million sentence and paragraph pairs
|
135 |
+
sampled from books.
|
136 |
+
|
137 |
+
|
138 |
+
|
139 |
+
## Training procedure
|
140 |
+
|
141 |
+
### Preprocessing
|
142 |
+
|
143 |
+
The ul2-base-nl36-en-nl T5 model uses a SentencePiece unigram tokenizer with a vocabulary of 32,000 tokens.
|
144 |
+
The tokenizer includes the special tokens `<pad>`, `</s>`, `<unk>`, known from the original T5 paper,
|
145 |
+
`[NLU]`, `[NLG]` and `[S2S]` for the MoD pre-training, and `<n>` for newline.
|
146 |
+
During pre-training with the UL2 objective, input and output sequences consist of 512 consecutive tokens.
|
147 |
+
The tokenizer does not lowercase texts and is therefore case-sensitive; it distinguises
|
148 |
+
between `dutch` and `Dutch`.
|
149 |
+
Additionally, 100+28 extra tokens were added for pre-training tasks, resulting in a total of 32,128 tokens.
|
150 |
+
|
151 |
+
### Fine-tuning
|
152 |
+
|
153 |
+
This model was fine-tuned on a dataset containing 13M sentence and paragraph translation pairs sampled from books.
|
154 |
+
|
155 |
+
* Pre-trained model used as starting point: yhavinga/ul2-base-nl36-dutch
|
156 |
+
* Amount of fine-tune training steps: 43415
|
157 |
+
* Batch size: 512 (gradient accumulation steps: 16)
|
158 |
+
* Sequence length: 370 tokens
|
159 |
+
* Model dtype: bfloat16
|
160 |
+
* z_loss: 0.0001
|
161 |
+
* Optimizer: adamw_hf beta1: 0.9 beta2: 0.9969 eps: 1e-08
|
162 |
+
* Dropout rate: 0.01
|
163 |
+
* Learning rate: 0.0009 with linear decay to 0 and warmup for 500 steps
|
164 |
+
* Label smoothing factor: 0.11
|
165 |
+
* Bleu score: 44.2
|
166 |
+
|
167 |
+
### Model list
|
168 |
+
|
169 |
+
Models in this series:
|
170 |
+
|
171 |
+
|
172 |
+
| | ul2-base-en-nl | ul2-base-nl36-en-nl | ul2-large-en-nl |
|
173 |
+
|:---------------------|:-----------------|:----------------------|:------------------|
|
174 |
+
| model_type | t5 | t5 | t5 |
|
175 |
+
| _pipeline_tag | translation | translation | translation |
|
176 |
+
| d_model | 768 | 768 | 1024 |
|
177 |
+
| d_ff | 2048 | 3072 | 2816 |
|
178 |
+
| num_heads | 12 | 12 | 16 |
|
179 |
+
| d_kv | 64 | 64 | 64 |
|
180 |
+
| num_layers | 12 | 36 | 24 |
|
181 |
+
| num_decoder_layers | 12 | 36 | 24 |
|
182 |
+
| feed_forward_proj | gated-silu | gated-silu | gated-silu |
|
183 |
+
| dense_act_fn | silu | silu | silu |
|
184 |
+
| vocab_size | 32128 | 32128 | 32128 |
|
185 |
+
| tie_word_embeddings | 0 | 0 | 0 |
|
186 |
+
| torch_dtype | float32 | float32 | float32 |
|
187 |
+
| _gin_batch_size | 128 | 64 | 64 |
|
188 |
+
| _gin_z_loss | 0.0001 | 0.0001 | 0.0001 |
|
189 |
+
| _gin_t5_config_dtype | 'bfloat16' | 'bfloat16' | 'bfloat16' |
|
190 |
+
|
191 |
+
## Evaluation results
|
192 |
+
|
193 |
+
See the evaluation section in the interactive [Pre-training Dutch T5 Models](https://huggingface.co/spaces/yhavinga/pre-training-dutch-t5-models) blog.
|
194 |
+
|
195 |
+
## Acknowledgements
|
196 |
+
|
197 |
+
This project would not have been possible without compute generously provided by Google through the
|
198 |
+
[TPU Research Cloud](https://sites.research.google/trc/).
|
199 |
+
Thanks to the [Finnish-NLP](https://huggingface.co/Finnish-NLP) authors for releasing their code for the UL2 objective and associated task definitions.
|
200 |
+
Thanks to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for helping me get started with the t5x framework.
|
201 |
+
|
202 |
+
Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
|
203 |
+
|
added_tokens.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"[new_id_17]": 32117, "[new_id_20]": 32120, "[new_id_13]": 32113, "[new_id_2]": 32102, "[new_id_16]": 32116, "[new_id_7]": 32107, "[new_id_5]": 32105, "[new_id_1]": 32101, "[new_id_15]": 32115, "[new_id_12]": 32112, "[new_id_0]": 32100, "[new_id_11]": 32111, "[new_id_25]": 32125, "[new_id_24]": 32124, "[new_id_10]": 32110, "[new_id_27]": 32127, "[new_id_23]": 32123, "[new_id_14]": 32114, "[new_id_22]": 32122, "[new_id_21]": 32121, "[new_id_19]": 32119, "[new_id_3]": 32103, "[new_id_4]": 32104, "[new_id_18]": 32118, "[new_id_9]": 32109, "[new_id_8]": 32108, "[new_id_26]": 32126, "[new_id_6]": 32106}
|
config.gin
ADDED
@@ -0,0 +1,150 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from __gin__ import dynamic_registration
|
2 |
+
import __main__ as train_script
|
3 |
+
import seqio
|
4 |
+
import t5.data.mixtures
|
5 |
+
from t5x import adafactor
|
6 |
+
from t5x.examples.t5 import network
|
7 |
+
from t5x import gin_utils
|
8 |
+
from t5x import models
|
9 |
+
from t5x import partitioning
|
10 |
+
from t5x import trainer
|
11 |
+
from t5x import utils
|
12 |
+
import tasks.nedd_tasks
|
13 |
+
import tasks.ul2_tasks as tasks2
|
14 |
+
|
15 |
+
# Macros:
|
16 |
+
# ==============================================================================
|
17 |
+
BATCH_SIZE = 64
|
18 |
+
DROPOUT_RATE = 0.0
|
19 |
+
LABEL_SMOOTHING = 0.0
|
20 |
+
LOSS_NORMALIZING_FACTOR = None
|
21 |
+
MIXTURE_OR_TASK_MODULE = None
|
22 |
+
MIXTURE_OR_TASK_NAME = 'ul2_mc4_nedd_wiki_news_mix_1'
|
23 |
+
MODEL = @models.EncoderDecoderModel()
|
24 |
+
MODEL_DIR = 'ul2_base_nl36_mc4_nedd_wiki_news_nl'
|
25 |
+
OPTIMIZER = @adafactor.Adafactor()
|
26 |
+
RANDOM_SEED = None
|
27 |
+
SHUFFLE_TRAIN_EXAMPLES = True
|
28 |
+
TASK_FEATURE_LENGTHS = {'inputs': 512, 'targets': 512}
|
29 |
+
TRAIN_STEPS = 2000000
|
30 |
+
USE_CACHED_TASKS = False
|
31 |
+
USE_HARDWARE_RNG = False
|
32 |
+
VOCABULARY = @seqio.SentencePieceVocabulary()
|
33 |
+
Z_LOSS = 0.0001
|
34 |
+
|
35 |
+
# Parameters for adafactor.Adafactor:
|
36 |
+
# ==============================================================================
|
37 |
+
adafactor.Adafactor.decay_rate = 0.8
|
38 |
+
adafactor.Adafactor.logical_factor_rules = \
|
39 |
+
@adafactor.standard_logical_factor_rules()
|
40 |
+
adafactor.Adafactor.step_offset = 0
|
41 |
+
|
42 |
+
# Parameters for utils.CheckpointConfig:
|
43 |
+
# ==============================================================================
|
44 |
+
utils.CheckpointConfig.restore = @utils.RestoreCheckpointConfig()
|
45 |
+
utils.CheckpointConfig.save = @utils.SaveCheckpointConfig()
|
46 |
+
|
47 |
+
# Parameters for utils.create_learning_rate_scheduler:
|
48 |
+
# ==============================================================================
|
49 |
+
utils.create_learning_rate_scheduler.base_learning_rate = 1.0
|
50 |
+
utils.create_learning_rate_scheduler.factors = 'constant * rsqrt_decay'
|
51 |
+
utils.create_learning_rate_scheduler.warmup_steps = 10000
|
52 |
+
|
53 |
+
# Parameters for train/utils.DatasetConfig:
|
54 |
+
# ==============================================================================
|
55 |
+
train/utils.DatasetConfig.batch_size = %BATCH_SIZE
|
56 |
+
train/utils.DatasetConfig.mixture_or_task_name = %MIXTURE_OR_TASK_NAME
|
57 |
+
train/utils.DatasetConfig.module = %MIXTURE_OR_TASK_MODULE
|
58 |
+
train/utils.DatasetConfig.pack = True
|
59 |
+
train/utils.DatasetConfig.seed = None
|
60 |
+
train/utils.DatasetConfig.shuffle = %SHUFFLE_TRAIN_EXAMPLES
|
61 |
+
train/utils.DatasetConfig.split = 'train'
|
62 |
+
train/utils.DatasetConfig.task_feature_lengths = %TASK_FEATURE_LENGTHS
|
63 |
+
train/utils.DatasetConfig.use_cached = %USE_CACHED_TASKS
|
64 |
+
|
65 |
+
# Parameters for train_eval/utils.DatasetConfig:
|
66 |
+
# ==============================================================================
|
67 |
+
train_eval/utils.DatasetConfig.batch_size = %BATCH_SIZE
|
68 |
+
train_eval/utils.DatasetConfig.mixture_or_task_name = %MIXTURE_OR_TASK_NAME
|
69 |
+
train_eval/utils.DatasetConfig.module = %MIXTURE_OR_TASK_MODULE
|
70 |
+
train_eval/utils.DatasetConfig.pack = True
|
71 |
+
train_eval/utils.DatasetConfig.seed = 42
|
72 |
+
train_eval/utils.DatasetConfig.shuffle = False
|
73 |
+
train_eval/utils.DatasetConfig.split = 'validation'
|
74 |
+
train_eval/utils.DatasetConfig.task_feature_lengths = %TASK_FEATURE_LENGTHS
|
75 |
+
train_eval/utils.DatasetConfig.use_cached = %USE_CACHED_TASKS
|
76 |
+
|
77 |
+
# Parameters for models.EncoderDecoderModel:
|
78 |
+
# ==============================================================================
|
79 |
+
models.EncoderDecoderModel.input_vocabulary = %VOCABULARY
|
80 |
+
models.EncoderDecoderModel.label_smoothing = %LABEL_SMOOTHING
|
81 |
+
models.EncoderDecoderModel.loss_normalizing_factor = %LOSS_NORMALIZING_FACTOR
|
82 |
+
models.EncoderDecoderModel.module = @network.Transformer()
|
83 |
+
models.EncoderDecoderModel.optimizer_def = %OPTIMIZER
|
84 |
+
models.EncoderDecoderModel.output_vocabulary = %VOCABULARY
|
85 |
+
models.EncoderDecoderModel.z_loss = %Z_LOSS
|
86 |
+
|
87 |
+
# Parameters for partitioning.PjitPartitioner:
|
88 |
+
# ==============================================================================
|
89 |
+
partitioning.PjitPartitioner.logical_axis_rules = \
|
90 |
+
@partitioning.standard_logical_axis_rules()
|
91 |
+
partitioning.PjitPartitioner.model_parallel_submesh = None
|
92 |
+
partitioning.PjitPartitioner.num_partitions = 1
|
93 |
+
|
94 |
+
# Parameters for utils.RestoreCheckpointConfig:
|
95 |
+
# ==============================================================================
|
96 |
+
utils.RestoreCheckpointConfig.path = []
|
97 |
+
|
98 |
+
# Parameters for utils.SaveCheckpointConfig:
|
99 |
+
# ==============================================================================
|
100 |
+
utils.SaveCheckpointConfig.dtype = 'float32'
|
101 |
+
utils.SaveCheckpointConfig.keep = 4
|
102 |
+
utils.SaveCheckpointConfig.period = 50000
|
103 |
+
utils.SaveCheckpointConfig.save_dataset = False
|
104 |
+
utils.SaveCheckpointConfig.use_gda = False
|
105 |
+
|
106 |
+
# Parameters for seqio.SentencePieceVocabulary:
|
107 |
+
# ==============================================================================
|
108 |
+
seqio.SentencePieceVocabulary.sentencepiece_model_file = \
|
109 |
+
'gs://t5-dutch-english/vocabs/nedd.32000.128extra/spiece.model'
|
110 |
+
|
111 |
+
# Parameters for network.T5Config:
|
112 |
+
# ==============================================================================
|
113 |
+
network.T5Config.dropout_rate = %DROPOUT_RATE
|
114 |
+
network.T5Config.dtype = 'bfloat16'
|
115 |
+
network.T5Config.emb_dim = 768
|
116 |
+
network.T5Config.head_dim = 64
|
117 |
+
network.T5Config.logits_via_embedding = False
|
118 |
+
network.T5Config.mlp_activations = ('gelu', 'linear')
|
119 |
+
network.T5Config.mlp_dim = 3072
|
120 |
+
network.T5Config.num_decoder_layers = 36
|
121 |
+
network.T5Config.num_encoder_layers = 36
|
122 |
+
network.T5Config.num_heads = 12
|
123 |
+
network.T5Config.vocab_size = 32128
|
124 |
+
|
125 |
+
# Parameters for train_script.train:
|
126 |
+
# ==============================================================================
|
127 |
+
train_script.train.checkpoint_cfg = @utils.CheckpointConfig()
|
128 |
+
train_script.train.eval_period = 2000
|
129 |
+
train_script.train.eval_steps = 20
|
130 |
+
train_script.train.infer_eval_dataset_cfg = None
|
131 |
+
train_script.train.model = %MODEL
|
132 |
+
train_script.train.model_dir = %MODEL_DIR
|
133 |
+
train_script.train.partitioner = @partitioning.PjitPartitioner()
|
134 |
+
train_script.train.random_seed = %RANDOM_SEED
|
135 |
+
train_script.train.stats_period = 100
|
136 |
+
train_script.train.summarize_config_fn = @gin_utils.summarize_gin_config
|
137 |
+
train_script.train.total_steps = %TRAIN_STEPS
|
138 |
+
train_script.train.train_dataset_cfg = @train/utils.DatasetConfig()
|
139 |
+
train_script.train.train_eval_dataset_cfg = @train_eval/utils.DatasetConfig()
|
140 |
+
train_script.train.trainer_cls = @trainer.Trainer
|
141 |
+
train_script.train.use_hardware_rng = %USE_HARDWARE_RNG
|
142 |
+
|
143 |
+
# Parameters for trainer.Trainer:
|
144 |
+
# ==============================================================================
|
145 |
+
trainer.Trainer.learning_rate_fn = @utils.create_learning_rate_scheduler()
|
146 |
+
trainer.Trainer.num_microbatches = None
|
147 |
+
|
148 |
+
# Parameters for network.Transformer:
|
149 |
+
# ==============================================================================
|
150 |
+
network.Transformer.config = @network.T5Config()
|
config.json
ADDED
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "./",
|
3 |
+
"architectures": [
|
4 |
+
"T5ForConditionalGeneration"
|
5 |
+
],
|
6 |
+
"d_ff": 3072,
|
7 |
+
"d_kv": 64,
|
8 |
+
"d_model": 768,
|
9 |
+
"decoder_start_token_id": 0,
|
10 |
+
"dense_act_fn": "silu",
|
11 |
+
"dropout_rate": 0.01,
|
12 |
+
"early_stopping": true,
|
13 |
+
"eos_token_id": 1,
|
14 |
+
"feed_forward_proj": "gated-silu",
|
15 |
+
"initializer_factor": 1.0,
|
16 |
+
"is_encoder_decoder": true,
|
17 |
+
"is_gated_act": true,
|
18 |
+
"layer_norm_epsilon": 1e-06,
|
19 |
+
"max_length": 370,
|
20 |
+
"model_type": "t5",
|
21 |
+
"num_beams": 4,
|
22 |
+
"num_decoder_layers": 36,
|
23 |
+
"num_heads": 12,
|
24 |
+
"num_layers": 36,
|
25 |
+
"output_past": true,
|
26 |
+
"pad_token_id": 0,
|
27 |
+
"relative_attention_max_distance": 128,
|
28 |
+
"relative_attention_num_buckets": 32,
|
29 |
+
"tie_word_embeddings": false,
|
30 |
+
"torch_dtype": "float32",
|
31 |
+
"transformers_version": "4.24.0",
|
32 |
+
"use_cache": true,
|
33 |
+
"vocab_size": 32128
|
34 |
+
}
|
events.out.tfevents.1673453219.t1v-n-c82e3785-w-0.4133.0.v2
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:6b6e252bece32e07b67707a9eb56c2bd1599dfda1084432a03d0f9f0d746f74b
|
3 |
+
size 1941504
|
flax_model.msgpack
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:0db5c8d7d9b492a2d2fe68a2197442fe1f12709055f0da7b85d8fec2cb08a34e
|
3 |
+
size 1677466902
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:952bd79ab8ec1a7c8fc294ecb8cd05851a4fd570b000166ed5bc36331afe44ce
|
3 |
+
size 3255881749
|
run_s2s_ul2-base-nl36-neddx2-en-nl.sh
ADDED
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
export CORES=`grep -c ^processor /proc/cpuinfo`
|
2 |
+
export CORES=`echo "scale=0; ${CORES} * 0.8 / 1" | bc`
|
3 |
+
|
4 |
+
#export XLA_PYTHON_CLIENT_PREALLOCATE=false
|
5 |
+
export SOURCE_LANG="en"
|
6 |
+
export TARGET_LANG="nl"
|
7 |
+
export HF_PROJECT="ul2-base-nl36-neddx2-en-nl"
|
8 |
+
#
|
9 |
+
export DATASET="/home/yeb/data/nedd_x_dataset/nedd_x_dataset.py"
|
10 |
+
#export DATASET_CONFIG="dict"
|
11 |
+
export DATASET_CONFIG="voc8k_beta_3buf"
|
12 |
+
export MODEL_NAME_OR_PATH="yhavinga/ul2-base-nl36-dutch"
|
13 |
+
export TOKENIZER_NAME="yhavinga/ul2-base-nl36-dutch"
|
14 |
+
export MODEL_PATH="${HOME}/data/${HF_PROJECT}" # Path to the model
|
15 |
+
export HF_DATASETS_CACHE=/mnt/ramdisk
|
16 |
+
|
17 |
+
# 52k 8k 32ksp
|
18 |
+
#l 472 500
|
19 |
+
#b0 328 352
|
20 |
+
#b1 472 480 370
|
21 |
+
#b2 1920 1984
|
22 |
+
|
23 |
+
mkdir -p ${MODEL_PATH}
|
24 |
+
|
25 |
+
python ../run_s2s_flax_pmap_multiseq.py \
|
26 |
+
--output_dir="${MODEL_PATH}" \
|
27 |
+
--model_name_or_path ${MODEL_NAME_OR_PATH} \
|
28 |
+
--tokenizer_name ${TOKENIZER_NAME} \
|
29 |
+
--use_fast_tokenizer="False" \
|
30 |
+
--use_auth_token="True" \
|
31 |
+
--dataset_name_list ${DATASET}\
|
32 |
+
--dataset_config_name_list "${DATASET_CONFIG}"\
|
33 |
+
--id_filter_list "<not>-b2-" \
|
34 |
+
--max_train_samples_list "0" \
|
35 |
+
--max_eval_samples_list "2000" \
|
36 |
+
--max_predict_samples_list "128" \
|
37 |
+
--preprocessing_num_workers="${CORES}" \
|
38 |
+
--source_lang="${SOURCE_LANG}" \
|
39 |
+
--target_lang="${TARGET_LANG}" \
|
40 |
+
--metric_name="sacrebleu" \
|
41 |
+
--do_train --do_eval --do_predict \
|
42 |
+
--predict_with_generate \
|
43 |
+
--learning_rate="0.0009" \
|
44 |
+
--adam_beta1="0.9" \
|
45 |
+
--adam_beta2="0.9969" \
|
46 |
+
--adam_epsilon="1e-8" \
|
47 |
+
--weight_decay="0.001" \
|
48 |
+
--label_smoothing_factor="0.11" \
|
49 |
+
--length_penalty="1.3" \
|
50 |
+
--warmup_steps 500 \
|
51 |
+
--dropout_rate="0.01" \
|
52 |
+
--dtype "bfloat16" \
|
53 |
+
--z_loss "1e-4" \
|
54 |
+
--dynamic_loss_scaling="False" \
|
55 |
+
--per_device_train_batch_size 4 \
|
56 |
+
--per_device_eval_batch_size 4 \
|
57 |
+
--gradient_accumulation_steps 16 \
|
58 |
+
--overwrite_output_dir \
|
59 |
+
--max_source_length_list 370 \
|
60 |
+
--max_target_length_list 370 \
|
61 |
+
--num_beams 5 \
|
62 |
+
--overwrite_output_dir \
|
63 |
+
--logging_steps 5 \
|
64 |
+
--save_steps 800 \
|
65 |
+
--eval_steps 800 \
|
66 |
+
--num_train_epochs 2 \
|
67 |
+
--max_eval_samples 512 \
|
68 |
+
--validation_split_count 2000 \
|
69 |
+
--wandb_project="${HF_PROJECT}" \
|
70 |
+
--wandb_job_type="pmap"
|
71 |
+
|
72 |
+
# --resume_from_checkpoint="${MODEL_PATH}" \
|
73 |
+
# --max_train_samples="1_064_886" \
|
74 |
+
# --max_eval_samples 256 \
|
75 |
+
# --max_predict_samples 256 \
|
special_tokens_map.json
ADDED
@@ -0,0 +1,107 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"additional_special_tokens": [
|
3 |
+
"<extra_id_0>",
|
4 |
+
"<extra_id_1>",
|
5 |
+
"<extra_id_2>",
|
6 |
+
"<extra_id_3>",
|
7 |
+
"<extra_id_4>",
|
8 |
+
"<extra_id_5>",
|
9 |
+
"<extra_id_6>",
|
10 |
+
"<extra_id_7>",
|
11 |
+
"<extra_id_8>",
|
12 |
+
"<extra_id_9>",
|
13 |
+
"<extra_id_10>",
|
14 |
+
"<extra_id_11>",
|
15 |
+
"<extra_id_12>",
|
16 |
+
"<extra_id_13>",
|
17 |
+
"<extra_id_14>",
|
18 |
+
"<extra_id_15>",
|
19 |
+
"<extra_id_16>",
|
20 |
+
"<extra_id_17>",
|
21 |
+
"<extra_id_18>",
|
22 |
+
"<extra_id_19>",
|
23 |
+
"<extra_id_20>",
|
24 |
+
"<extra_id_21>",
|
25 |
+
"<extra_id_22>",
|
26 |
+
"<extra_id_23>",
|
27 |
+
"<extra_id_24>",
|
28 |
+
"<extra_id_25>",
|
29 |
+
"<extra_id_26>",
|
30 |
+
"<extra_id_27>",
|
31 |
+
"<extra_id_28>",
|
32 |
+
"<extra_id_29>",
|
33 |
+
"<extra_id_30>",
|
34 |
+
"<extra_id_31>",
|
35 |
+
"<extra_id_32>",
|
36 |
+
"<extra_id_33>",
|
37 |
+
"<extra_id_34>",
|
38 |
+
"<extra_id_35>",
|
39 |
+
"<extra_id_36>",
|
40 |
+
"<extra_id_37>",
|
41 |
+
"<extra_id_38>",
|
42 |
+
"<extra_id_39>",
|
43 |
+
"<extra_id_40>",
|
44 |
+
"<extra_id_41>",
|
45 |
+
"<extra_id_42>",
|
46 |
+
"<extra_id_43>",
|
47 |
+
"<extra_id_44>",
|
48 |
+
"<extra_id_45>",
|
49 |
+
"<extra_id_46>",
|
50 |
+
"<extra_id_47>",
|
51 |
+
"<extra_id_48>",
|
52 |
+
"<extra_id_49>",
|
53 |
+
"<extra_id_50>",
|
54 |
+
"<extra_id_51>",
|
55 |
+
"<extra_id_52>",
|
56 |
+
"<extra_id_53>",
|
57 |
+
"<extra_id_54>",
|
58 |
+
"<extra_id_55>",
|
59 |
+
"<extra_id_56>",
|
60 |
+
"<extra_id_57>",
|
61 |
+
"<extra_id_58>",
|
62 |
+
"<extra_id_59>",
|
63 |
+
"<extra_id_60>",
|
64 |
+
"<extra_id_61>",
|
65 |
+
"<extra_id_62>",
|
66 |
+
"<extra_id_63>",
|
67 |
+
"<extra_id_64>",
|
68 |
+
"<extra_id_65>",
|
69 |
+
"<extra_id_66>",
|
70 |
+
"<extra_id_67>",
|
71 |
+
"<extra_id_68>",
|
72 |
+
"<extra_id_69>",
|
73 |
+
"<extra_id_70>",
|
74 |
+
"<extra_id_71>",
|
75 |
+
"<extra_id_72>",
|
76 |
+
"<extra_id_73>",
|
77 |
+
"<extra_id_74>",
|
78 |
+
"<extra_id_75>",
|
79 |
+
"<extra_id_76>",
|
80 |
+
"<extra_id_77>",
|
81 |
+
"<extra_id_78>",
|
82 |
+
"<extra_id_79>",
|
83 |
+
"<extra_id_80>",
|
84 |
+
"<extra_id_81>",
|
85 |
+
"<extra_id_82>",
|
86 |
+
"<extra_id_83>",
|
87 |
+
"<extra_id_84>",
|
88 |
+
"<extra_id_85>",
|
89 |
+
"<extra_id_86>",
|
90 |
+
"<extra_id_87>",
|
91 |
+
"<extra_id_88>",
|
92 |
+
"<extra_id_89>",
|
93 |
+
"<extra_id_90>",
|
94 |
+
"<extra_id_91>",
|
95 |
+
"<extra_id_92>",
|
96 |
+
"<extra_id_93>",
|
97 |
+
"<extra_id_94>",
|
98 |
+
"<extra_id_95>",
|
99 |
+
"<extra_id_96>",
|
100 |
+
"<extra_id_97>",
|
101 |
+
"<extra_id_98>",
|
102 |
+
"<extra_id_99>"
|
103 |
+
],
|
104 |
+
"eos_token": "</s>",
|
105 |
+
"pad_token": "<pad>",
|
106 |
+
"unk_token": "<unk>"
|
107 |
+
}
|
spiece.model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:caa6e2f21aeec181276ab80273e3f869ce303ccb8602d68e0524783c3581092d
|
3 |
+
size 800223
|
spiece.vocab
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1,113 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"additional_special_tokens": [
|
3 |
+
"<extra_id_0>",
|
4 |
+
"<extra_id_1>",
|
5 |
+
"<extra_id_2>",
|
6 |
+
"<extra_id_3>",
|
7 |
+
"<extra_id_4>",
|
8 |
+
"<extra_id_5>",
|
9 |
+
"<extra_id_6>",
|
10 |
+
"<extra_id_7>",
|
11 |
+
"<extra_id_8>",
|
12 |
+
"<extra_id_9>",
|
13 |
+
"<extra_id_10>",
|
14 |
+
"<extra_id_11>",
|
15 |
+
"<extra_id_12>",
|
16 |
+
"<extra_id_13>",
|
17 |
+
"<extra_id_14>",
|
18 |
+
"<extra_id_15>",
|
19 |
+
"<extra_id_16>",
|
20 |
+
"<extra_id_17>",
|
21 |
+
"<extra_id_18>",
|
22 |
+
"<extra_id_19>",
|
23 |
+
"<extra_id_20>",
|
24 |
+
"<extra_id_21>",
|
25 |
+
"<extra_id_22>",
|
26 |
+
"<extra_id_23>",
|
27 |
+
"<extra_id_24>",
|
28 |
+
"<extra_id_25>",
|
29 |
+
"<extra_id_26>",
|
30 |
+
"<extra_id_27>",
|
31 |
+
"<extra_id_28>",
|
32 |
+
"<extra_id_29>",
|
33 |
+
"<extra_id_30>",
|
34 |
+
"<extra_id_31>",
|
35 |
+
"<extra_id_32>",
|
36 |
+
"<extra_id_33>",
|
37 |
+
"<extra_id_34>",
|
38 |
+
"<extra_id_35>",
|
39 |
+
"<extra_id_36>",
|
40 |
+
"<extra_id_37>",
|
41 |
+
"<extra_id_38>",
|
42 |
+
"<extra_id_39>",
|
43 |
+
"<extra_id_40>",
|
44 |
+
"<extra_id_41>",
|
45 |
+
"<extra_id_42>",
|
46 |
+
"<extra_id_43>",
|
47 |
+
"<extra_id_44>",
|
48 |
+
"<extra_id_45>",
|
49 |
+
"<extra_id_46>",
|
50 |
+
"<extra_id_47>",
|
51 |
+
"<extra_id_48>",
|
52 |
+
"<extra_id_49>",
|
53 |
+
"<extra_id_50>",
|
54 |
+
"<extra_id_51>",
|
55 |
+
"<extra_id_52>",
|
56 |
+
"<extra_id_53>",
|
57 |
+
"<extra_id_54>",
|
58 |
+
"<extra_id_55>",
|
59 |
+
"<extra_id_56>",
|
60 |
+
"<extra_id_57>",
|
61 |
+
"<extra_id_58>",
|
62 |
+
"<extra_id_59>",
|
63 |
+
"<extra_id_60>",
|
64 |
+
"<extra_id_61>",
|
65 |
+
"<extra_id_62>",
|
66 |
+
"<extra_id_63>",
|
67 |
+
"<extra_id_64>",
|
68 |
+
"<extra_id_65>",
|
69 |
+
"<extra_id_66>",
|
70 |
+
"<extra_id_67>",
|
71 |
+
"<extra_id_68>",
|
72 |
+
"<extra_id_69>",
|
73 |
+
"<extra_id_70>",
|
74 |
+
"<extra_id_71>",
|
75 |
+
"<extra_id_72>",
|
76 |
+
"<extra_id_73>",
|
77 |
+
"<extra_id_74>",
|
78 |
+
"<extra_id_75>",
|
79 |
+
"<extra_id_76>",
|
80 |
+
"<extra_id_77>",
|
81 |
+
"<extra_id_78>",
|
82 |
+
"<extra_id_79>",
|
83 |
+
"<extra_id_80>",
|
84 |
+
"<extra_id_81>",
|
85 |
+
"<extra_id_82>",
|
86 |
+
"<extra_id_83>",
|
87 |
+
"<extra_id_84>",
|
88 |
+
"<extra_id_85>",
|
89 |
+
"<extra_id_86>",
|
90 |
+
"<extra_id_87>",
|
91 |
+
"<extra_id_88>",
|
92 |
+
"<extra_id_89>",
|
93 |
+
"<extra_id_90>",
|
94 |
+
"<extra_id_91>",
|
95 |
+
"<extra_id_92>",
|
96 |
+
"<extra_id_93>",
|
97 |
+
"<extra_id_94>",
|
98 |
+
"<extra_id_95>",
|
99 |
+
"<extra_id_96>",
|
100 |
+
"<extra_id_97>",
|
101 |
+
"<extra_id_98>",
|
102 |
+
"<extra_id_99>"
|
103 |
+
],
|
104 |
+
"eos_token": "</s>",
|
105 |
+
"extra_ids": 100,
|
106 |
+
"name_or_path": "yhavinga/ul2-base-nl36-dutch",
|
107 |
+
"pad_token": "<pad>",
|
108 |
+
"sp_model_kwargs": {},
|
109 |
+
"special_tokens_map_file": null,
|
110 |
+
"tokenizer_class": "T5Tokenizer",
|
111 |
+
"unk_token": "<unk>",
|
112 |
+
"use_fast_tokenizer": false
|
113 |
+
}
|
training_state.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"step": 691215}
|