yhavinga commited on
Commit
37e7b34
β€’
1 Parent(s): fa73be6

Small updates

Browse files
Files changed (6) hide show
  1. INTRO.md +1 -1
  2. PRETRAINING.md +1 -6
  3. README.md +1 -0
  4. REMARKS.md +6 -8
  5. app.py +133 -56
  6. requirements.txt +1 -1
INTRO.md CHANGED
@@ -1,6 +1,6 @@
1
  # Dutch T5 models : UL2, T5, ByT5 and Long-T5 πŸ‡³πŸ‡±πŸ‡§πŸ‡ͺ
2
 
3
- TL;DR: Dutch T5 and UL2 Models Trained with Google's TPU Research Cloud and mC4 Dataset Show Outstanding Performance in NLP Tasks.
4
  See below for model lists and comparison.
5
 
6
  During the [HuggingFace Flax/Jax community week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104) in the summer of 2021,
 
1
  # Dutch T5 models : UL2, T5, ByT5 and Long-T5 πŸ‡³πŸ‡±πŸ‡§πŸ‡ͺ
2
 
3
+ TL;DR: Dutch NLP gets a boost with state-of-the-art T5 models trained on the largest Dutch corpus, mC4, and additional datasets.
4
  See below for model lists and comparison.
5
 
6
  During the [HuggingFace Flax/Jax community week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104) in the summer of 2021,
PRETRAINING.md CHANGED
@@ -8,15 +8,10 @@ It was made available by AllenNLP on the HuggingFace Dataset hub.
8
  Our team confirmed that the Dutch portion of the mC4 dataset was deduplicated,
9
  and we cleaned the Dutch portion of the mC4 dataset using [code adapted](https://gitlab.com/yhavinga/c4nlpreproc) from the TensorFlow C4 dataset.
10
  The resulting [mc4_nl_cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned) dataset on the HuggingFace hub
11
- has configs for several sizes, and also configs for mixed Dutch and English
12
  texts, e.g. [micro_en_nl](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned/viewer/micro_en_nl/train).
13
  The `_en_nl` configs were added to accommodate multi-language pre-training
14
  with the Huggingface pre-training script, that accepts only a single dataset as input.
15
- Cleaned English C4 is roughly 5 times larger than its Dutch counterpart. Therefore,
16
- interleaving the datasets in a 1:1 ratio results in discarding approximately 80% of the English data.
17
- (When pre-training with T5X and SeqIO, it is possible to define task mixtures that include multiple datasets,
18
- so these `_en_nl` configs are not needed.)
19
-
20
  The full, cleaned Dutch mC4 dataset is 151GB and remains (as of June 2022) the largest available Dutch
21
  corpus on the HuggingFace Dataset hub.
22
 
 
8
  Our team confirmed that the Dutch portion of the mC4 dataset was deduplicated,
9
  and we cleaned the Dutch portion of the mC4 dataset using [code adapted](https://gitlab.com/yhavinga/c4nlpreproc) from the TensorFlow C4 dataset.
10
  The resulting [mc4_nl_cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned) dataset on the HuggingFace hub
11
+ has configs for several sizes, and also configs for interleaved mixed Dutch and English
12
  texts, e.g. [micro_en_nl](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned/viewer/micro_en_nl/train).
13
  The `_en_nl` configs were added to accommodate multi-language pre-training
14
  with the Huggingface pre-training script, that accepts only a single dataset as input.
 
 
 
 
 
15
  The full, cleaned Dutch mC4 dataset is 151GB and remains (as of June 2022) the largest available Dutch
16
  corpus on the HuggingFace Dataset hub.
17
 
README.md CHANGED
@@ -4,6 +4,7 @@ emoji: πŸš€
4
  colorFrom: blue
5
  colorTo: pink
6
  sdk: streamlit
 
7
  pinned: false
8
  app_file: app.py
9
  license: afl-3.0
 
4
  colorFrom: blue
5
  colorTo: pink
6
  sdk: streamlit
7
+ sdk_version: 1.10.0
8
  pinned: false
9
  app_file: app.py
10
  license: afl-3.0
REMARKS.md CHANGED
@@ -1,7 +1,7 @@
1
  ## Miscellaneous remarks
2
 
3
- * Use loss regularization if you train with `bfloat16` (more info below)
4
- * Beware of the dropout rate in the config.json file.
5
  Check in a model's `config.json` what the dropout rate has been set to. Unless you
6
  intend to run many epochs on the same data, its worth to try a training run without dropout.
7
  If you want to compare losses, be sure to set the dropout rate equal.
@@ -9,14 +9,12 @@
9
  * Training with more layers is much slower than you'd expect from the increased model size.
10
  It is also more difficult to get batch size and learning rate right. Below is a section
11
  about finding the right hyperparameters for the base-36L training.
12
- * The 'larger' models are not only harder to pre-train, but also harder to fine-tune. The optimizer eats up a lot of
13
- space, and the amount of memory required also depends on the length of source and target sequences.
14
- * PyCharms remote debugging features are useful to inspect variables on either a TPU VM or your deep-learning rig.
15
  * When increasing the batch size, increase the learning rate. bs * 2 -> lr * sqrt(2) is a good heuristic but mileage may
16
  vary.
17
- * Translation evaluation: the low score of the 128 seq len models on opus books may be because of the brevity penaly...
18
- that books may have sentences longer than 128 tokens.
19
  * Dataset quality is a key success factor. Do not expect a model to magically turn mediocre data into magic. This holds for
20
  the pre-training data, fine-tuning and also evaluating.
21
  * Good Bleu score does not necessarily mean fluent text. Evaluation loss on a large translation dataset might be
22
- better suited for model comparison.
 
1
  ## Miscellaneous remarks
2
 
3
+ * Use loss regularization when training with `bfloat16` for better results (more info below).
4
+ * Be cautious of the dropout rate in the config.json file and consider training without it.
5
  Check in a model's `config.json` what the dropout rate has been set to. Unless you
6
  intend to run many epochs on the same data, its worth to try a training run without dropout.
7
  If you want to compare losses, be sure to set the dropout rate equal.
 
9
  * Training with more layers is much slower than you'd expect from the increased model size.
10
  It is also more difficult to get batch size and learning rate right. Below is a section
11
  about finding the right hyperparameters for the base-36L training.
12
+ * For the translation task, I am not sure that a 'deep-narrow' model (e.g. base-nl36) is better than a normal model
13
+ of comparable size (e.g. `large`).
14
+ * PyCharm's remote debugging features are useful to inspect variables on either a TPU VM or your deep-learning rig.
15
  * When increasing the batch size, increase the learning rate. bs * 2 -> lr * sqrt(2) is a good heuristic but mileage may
16
  vary.
 
 
17
  * Dataset quality is a key success factor. Do not expect a model to magically turn mediocre data into magic. This holds for
18
  the pre-training data, fine-tuning and also evaluating.
19
  * Good Bleu score does not necessarily mean fluent text. Evaluation loss on a large translation dataset might be
20
+ better suited for model comparison, if the models have a tokenizer of comparable size.
app.py CHANGED
@@ -1,4 +1,5 @@
1
  from glob import glob
 
2
  import sqlite3
3
  import psutil
4
  import streamlit as st
@@ -11,7 +12,7 @@ IMAGE_WIDTHS = 900
11
  PRE_TRAINED_DB = "data/pretrained.sqlite"
12
 
13
 
14
- @st.cache_data
15
  def load_eval_data():
16
  conn = sqlite3.connect(PRE_TRAINED_DB)
17
  conn.row_factory = lambda c, r: {
@@ -35,17 +36,30 @@ def load_eval_data():
35
  columns={"summ_rouge1": "summ Rouge1", "trans_en_nl_score": "en->nl Bleu"},
36
  inplace=True,
37
  )
38
- # for each model, read the summary text
39
  for i, row in df.iterrows():
40
- dirs = glob(f"data/eval_summ_results/{row['id']}-{row['name']}/yhavinga_cnn_dailymail_dutch/eval_predictions*")
 
 
41
  try:
42
  file = dirs[-1] + "/generated.txt"
43
  with open(file, "r") as f:
44
- text = str(row["id"]) + " " + f.read().replace("<n>", " ")
45
  except Exception:
46
  text = "fine-tune failed, no data"
47
  df.at[i, "summary"] = text
48
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  # order df by the name column desc
50
  df.sort_values(by="name", inplace=True, ascending=False)
51
 
@@ -105,12 +119,18 @@ mT5 green and the other models black.
105
  )
106
  col1, col2 = st.columns(2)
107
  with col1:
108
- ul2_enabled = st.checkbox("UL2 Dutch (and English) (trained with T5X)", value=True)
 
 
109
  t5_1_1_enabled = st.checkbox("t5_1_1 Dutch (trained with T5X)", value=True)
110
  flan_enabled = st.checkbox("Flan T5 (google/flan-t5-*)", value=True)
111
  mt5_enabled = st.checkbox("mt5 (google/mt5-*)", value=True)
112
- long_t5_enabled = st.checkbox("Long T5 Dutch+English (trained with HuggingFace script)")
113
- t5_v1_1_enabled = st.checkbox("T5 Dutch (and English) (trained with HuggingFace script)")
 
 
 
 
114
  with col2:
115
  small_enabled = st.checkbox("small model sizes")
116
  base_enabled = st.checkbox("base model sizes")
@@ -126,15 +146,51 @@ mT5 green and the other models black.
126
  | (plot_df["name"].str.contains("mt5") & mt5_enabled)
127
  | (plot_df["name"].str.contains("long-t5") & long_t5_enabled)
128
  | (plot_df["name"].str.contains("t5_1_1") & t5_1_1_enabled)
129
- | ((plot_df["name"].str.startswith("t5") & ~plot_df["name"].str.startswith("t5_1_1")) & t5_v1_1_enabled)
130
- | (plot_df["name"].str.contains("base") & base_enabled & ~plot_df["name"].str.contains("36"))
131
- | (plot_df["name"].str.contains("small") & small_enabled & ~plot_df["name"].str.contains("24"))
132
- | (plot_df["name"].str.contains("large") & large_enabled & ~plot_df["name"].str.contains("8"))
133
- | ((plot_df["name"].str.contains("-36L") | plot_df["name"].str.contains("nl36")) & _36_enabled)
134
- | ((plot_df["name"].str.contains("-24L") | plot_df["name"].str.contains("nl24")) & _24_enabled)
135
- | ((plot_df["name"].str.contains("-8l") | plot_df["name"].str.contains("nl8")) & _8l_enabled)
136
- | ((plot_df["name"].str.contains("-4L") | plot_df["name"].str.contains("nl4")) & _4xl_enabled)
137
- ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
 
139
  color_dict = {"flan": "red", "ul2": "blue", "mt5": "green", "t5_1_1": "orange"}
140
  colors = [
@@ -165,45 +221,73 @@ mT5 green and the other models black.
165
  )
166
  plt.tight_layout()
167
  st.pyplot(fig)
168
- st.markdown("""* The `UL2` pre-trained Dutch(English) models consistently outperform the `T5-*` Dutch(English) models.
 
169
  * Flan models perform almost instantly well on the summarization task, with `flan-t5-small`
170
  showing performance comparable to Dutch T5 base models.
171
- * Fine-tuning of `t5-v1.1-large-dutch-cased` failed with the fixed hyperparameters across all models.
172
- Since the `UL2` models are better across the board, I've disabled this model on the hub.
173
- * I am surprised by the consistent bad scores for the `long-t5` runs. I've retried the fine-tuning of these models with
174
- `float32` instead of `bfloat16`, but the results were the same. Maybe this is normal behaviour for these models
175
- targeted at dealing with longer sequence lengths.
176
  * For the translation task from English to Dutch, the Dutch+English pre-trained models perform well. Also
177
  `UL2 Dutch` pre-trained Dutch models are consistently better than their `Flan`, `T5 Dutch` and
178
  `mT5` counterparts of the comparable size.
179
- * For the translation task, I am not sure that a 'deep-narrow' model (e.g. base-nl36) is better than a normal model
180
- or even a 'wide-deep' model.
181
  * The `long-t5` models show bad performance on both tasks.
182
  I cannot explain this the translation task. With a sequence length of 128 input and output
183
  tokens, the sliding attention window with radius length 127 of the `long-t5` models should be able to handle this.
184
- """)
 
 
 
 
185
 
186
- st.markdown("### Compare generated summaries")
187
  col1, col2 = st.columns(2)
188
  with col1:
189
- model_left = st.selectbox("Choose left model", df["name"], index=6)
 
 
190
  with col2:
191
- model_right = st.selectbox("Choose right model", df["name"], index=33)
 
 
192
 
193
- @st.cache_resource
194
  def get_row(model):
195
  return df[df["name"] == model]
196
 
197
- row_left = get_row(model_left)
198
- row_right = get_row(model_right)
199
 
200
  contents1 = row_left["summary"].values[0].split("\n")
201
  contents2 = row_right["summary"].values[0].split("\n")
202
- contents = list(zip(contents1, contents2))[:5]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
203
  st.table(
204
  pd.DataFrame(
205
  contents,
206
- columns=[model_left, model_right],
207
  )
208
  )
209
 
@@ -213,8 +297,8 @@ mT5 green and the other models black.
213
  st.markdown(
214
  """### Bfloat16 datatype requires loss regularization
215
 
216
- When training models with `bfloat16` and without loss regularization (default), the training losses would plateau or
217
- diverge. The graph below displays the results of different attempts
218
  to train [t5-small-24L-dutch-english](https://huggingface.co/yhavinga/t5-small-24L-dutch-english).
219
  The legend indicates the optimizer, data type, learning rate, total batch size, and learning rate schedule used.
220
  As you can see, all attempts to train with `bfloat16` failed.
@@ -230,7 +314,7 @@ and the `bfloat16` training runs did not exhibit the problems illustrated above
230
 
231
  The `z_loss` regularization term in the T5X loss function looks like L2 regularization.
232
  (See e.g. Andrej Karpathy [explaining regularization loss](https://youtu.be/PaCmpygFfXo?t=6720)).
233
- The Optax optimizer, used in the HuggingFace script, mentions weight decay for AdaFactor (and AdamW)
234
  but also mentions that L2 regularization does not work as expected with adaptive gradient
235
  algorithms. It might be the case that setting a non-zero `weight_decay_rate` in the Optax Adafactor call
236
  in the HuggingFace pre-training script is an alternative to adding the `z_loss` term, to solve the bfloat16 issues, but
@@ -292,7 +376,7 @@ models to converge during fine-tuning.
292
  """### Pre-training with sequence length 512 or 1024
293
 
294
  The models `t5-v1_1-base-dutch-english-cased` and `t5-v1_1-base-dutch-english-cased-1024` have the same model dimensions,
295
- but are pre-trained on different sequence lenghts, 512 and 1024 respectively.
296
  The evaluation loss and accuracy of the models do not look too different. Since training of the 1024 sequence length model was
297
  very slow and didn't converge, I stopped it early. The figure below shows the evaluation
298
  loss and accuracy.
@@ -310,10 +394,6 @@ summarization and translation.
310
  st.markdown(
311
  """## Model lists
312
 
313
- ### t5_1_1
314
-
315
- TODO
316
-
317
  ### UL2 Dutch English
318
 
319
  These models have been trained with T5X on mc4_nl_cleaned, books, Wikipedia and news.
@@ -390,24 +470,16 @@ the several dimensions of these models.
390
  | *eval loss* | 1,38 | 1,20 | 0,96 | 1,07 | 1,11 | 1,13 | 1,18 | 1,27 | 1,05 | 1,3019 | 1,15 |
391
  | *eval acc* | 0,70 | 0,73 | 0,78 | 0,76 | 0,75 | 0,74 | 0,74 | 0,72 | 0,76 | 0,71 | 0,74 |
392
 
393
- ### Long-T5 models
394
-
395
- These models have been trained with the HuggingFace πŸ€— run_t5_mlm_flax.py script on mc4_nl_cleaned.
396
-
397
- ### Byt5 small
398
-
399
- This model has been trained with the HuggingFace πŸ€— run_t5_mlm_flax.py script on mc4_nl_cleaned.
400
-
401
- TODO
402
-
403
  ### Fine-tuned translation models on ccmatrix
404
 
405
  The models `t5-small-24L-dutch-english` and `t5-base-36L-dutch-english` have been fine-tuned for both language
406
  directions on the first 25M samples from CCMatrix, giving a total of 50M training samples.
407
  Evaluation is performed on out-of-sample CCMatrix and also on Tatoeba and Opus Books.
408
- The `_bp` columns list the *brevity penalty*. The `avg_bleu` score is the bleu score
 
409
  averaged over all three evaluation datasets. The best scores displayed in bold for both translation directions.
410
 
 
411
  | | [t5-base-36L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-base-36L-ccmatrix-multi) | [t5-base-36L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-base-36L-ccmatrix-multi) | [t5-small-24L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-small-24L-ccmatrix-multi) | [t5-small-24L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-small-24L-ccmatrix-multi) |
412
  |:-----------------------|:-----------------------------|:-----------------------------|:------------------------------|:------------------------------|
413
  | *source_lang* | en | nl | en | nl |
@@ -436,12 +508,17 @@ averaged over all three evaluation datasets. The best scores displayed in bold f
436
 
437
  ## Acknowledgements
438
 
439
- This project would not have been possible without compute generously provided by Google through the
440
- [TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace πŸ€— ecosystem was instrumental in all parts
441
- of the training. Weights & Biases made it possible to keep track of many training sessions
442
- and orchestrate hyperparameter sweeps with insightful visualizations.
 
 
 
 
443
 
444
  Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
 
445
  """
446
  )
447
 
 
1
  from glob import glob
2
+ from itertools import zip_longest
3
  import sqlite3
4
  import psutil
5
  import streamlit as st
 
12
  PRE_TRAINED_DB = "data/pretrained.sqlite"
13
 
14
 
15
+ @st.cache
16
  def load_eval_data():
17
  conn = sqlite3.connect(PRE_TRAINED_DB)
18
  conn.row_factory = lambda c, r: {
 
36
  columns={"summ_rouge1": "summ Rouge1", "trans_en_nl_score": "en->nl Bleu"},
37
  inplace=True,
38
  )
 
39
  for i, row in df.iterrows():
40
+ dirs = glob(
41
+ f"data/eval_summ_results/{row['id']}-{row['name']}/yhavinga_cnn_dailymail_dutch/eval_predictions*"
42
+ )
43
  try:
44
  file = dirs[-1] + "/generated.txt"
45
  with open(file, "r") as f:
46
+ text = f.read().replace("<n>", " ")
47
  except Exception:
48
  text = "fine-tune failed, no data"
49
  df.at[i, "summary"] = text
50
 
51
+ for i, row in df.iterrows():
52
+ dirs = glob(
53
+ f"data/eval_transl_results/{row['id']}-{row['name']}/yhavinga_ccmatrix/eval_predictions*"
54
+ )
55
+ try:
56
+ file = dirs[-1] + "/generated.txt"
57
+ with open(file, "r") as f:
58
+ text = f.read().replace("<n>", " ")
59
+ except Exception:
60
+ text = "fine-tune failed, no data"
61
+ df.at[i, "translation"] = text
62
+
63
  # order df by the name column desc
64
  df.sort_values(by="name", inplace=True, ascending=False)
65
 
 
119
  )
120
  col1, col2 = st.columns(2)
121
  with col1:
122
+ ul2_enabled = st.checkbox(
123
+ "UL2 Dutch (and English) (trained with T5X)", value=True
124
+ )
125
  t5_1_1_enabled = st.checkbox("t5_1_1 Dutch (trained with T5X)", value=True)
126
  flan_enabled = st.checkbox("Flan T5 (google/flan-t5-*)", value=True)
127
  mt5_enabled = st.checkbox("mt5 (google/mt5-*)", value=True)
128
+ long_t5_enabled = st.checkbox(
129
+ "Long T5 Dutch+English (trained with HuggingFace script)"
130
+ )
131
+ t5_v1_1_enabled = st.checkbox(
132
+ "T5 Dutch (and English) (trained with HuggingFace script)"
133
+ )
134
  with col2:
135
  small_enabled = st.checkbox("small model sizes")
136
  base_enabled = st.checkbox("base model sizes")
 
146
  | (plot_df["name"].str.contains("mt5") & mt5_enabled)
147
  | (plot_df["name"].str.contains("long-t5") & long_t5_enabled)
148
  | (plot_df["name"].str.contains("t5_1_1") & t5_1_1_enabled)
149
+ | (
150
+ (
151
+ plot_df["name"].str.startswith("t5")
152
+ & ~plot_df["name"].str.startswith("t5_1_1")
153
+ )
154
+ & t5_v1_1_enabled
155
+ )
156
+ | (
157
+ plot_df["name"].str.contains("base")
158
+ & base_enabled
159
+ & ~plot_df["name"].str.contains("36")
160
+ )
161
+ | (
162
+ plot_df["name"].str.contains("small")
163
+ & small_enabled
164
+ & ~plot_df["name"].str.contains("24")
165
+ )
166
+ | (
167
+ plot_df["name"].str.contains("large")
168
+ & large_enabled
169
+ & ~plot_df["name"].str.contains("8")
170
+ )
171
+ | (
172
+ (
173
+ plot_df["name"].str.contains("-36L")
174
+ | plot_df["name"].str.contains("nl36")
175
+ )
176
+ & _36_enabled
177
+ )
178
+ | (
179
+ (
180
+ plot_df["name"].str.contains("-24L")
181
+ | plot_df["name"].str.contains("nl24")
182
+ )
183
+ & _24_enabled
184
+ )
185
+ | (
186
+ (plot_df["name"].str.contains("-8l") | plot_df["name"].str.contains("nl8"))
187
+ & _8l_enabled
188
+ )
189
+ | (
190
+ (plot_df["name"].str.contains("-4L") | plot_df["name"].str.contains("nl4"))
191
+ & _4xl_enabled
192
+ )
193
+ ]
194
 
195
  color_dict = {"flan": "red", "ul2": "blue", "mt5": "green", "t5_1_1": "orange"}
196
  colors = [
 
221
  )
222
  plt.tight_layout()
223
  st.pyplot(fig)
224
+ st.markdown(
225
+ """* The `UL2` pre-trained Dutch(English) models consistently outperform the `T5-*` Dutch(English) models.
226
  * Flan models perform almost instantly well on the summarization task, with `flan-t5-small`
227
  showing performance comparable to Dutch T5 base models.
 
 
 
 
 
228
  * For the translation task from English to Dutch, the Dutch+English pre-trained models perform well. Also
229
  `UL2 Dutch` pre-trained Dutch models are consistently better than their `Flan`, `T5 Dutch` and
230
  `mT5` counterparts of the comparable size.
231
+ * Fine-tuning of `t5-v1.1-large-dutch-cased` failed with the fixed hyperparameters across all models.
232
+ Since the `UL2` models are better across the board, I've disabled this model on the hub.
233
  * The `long-t5` models show bad performance on both tasks.
234
  I cannot explain this the translation task. With a sequence length of 128 input and output
235
  tokens, the sliding attention window with radius length 127 of the `long-t5` models should be able to handle this.
236
+ I've retried the fine-tuning of these models with
237
+ `float32` instead of `bfloat16`, but the results were the same. Maybe this is normal behaviour for these models
238
+ targeted at dealing with longer sequence lengths.
239
+ """
240
+ )
241
 
242
+ st.markdown("### Compare generated texts")
243
  col1, col2 = st.columns(2)
244
  with col1:
245
+ summ_model_left = st.selectbox(
246
+ "Choose left summarization model", df["name"], index=6
247
+ )
248
  with col2:
249
+ summ_model_right = st.selectbox(
250
+ "Choose right summarization model", df["name"], index=33
251
+ )
252
 
253
+ @st.cache
254
  def get_row(model):
255
  return df[df["name"] == model]
256
 
257
+ row_left = get_row(summ_model_left)
258
+ row_right = get_row(summ_model_right)
259
 
260
  contents1 = row_left["summary"].values[0].split("\n")
261
  contents2 = row_right["summary"].values[0].split("\n")
262
+ contents = list(zip_longest(contents1, contents2))[:5]
263
+ st.table(
264
+ pd.DataFrame(
265
+ contents,
266
+ columns=[summ_model_left, summ_model_right],
267
+ )
268
+ )
269
+
270
+ st.markdown("### Compare generated translations")
271
+ col1, col2 = st.columns(2)
272
+ with col1:
273
+ trans_model_left = st.selectbox("Choose left model", df["name"], index=3)
274
+ with col2:
275
+ trans_model_right = st.selectbox("Choose right model", df["name"], index=32)
276
+
277
+ @st.cache
278
+ def get_row(model):
279
+ return df[df["name"] == model]
280
+
281
+ row_left = get_row(trans_model_left)
282
+ row_right = get_row(trans_model_right)
283
+
284
+ contents1 = row_left["translation"].values[0].split("\n")
285
+ contents2 = row_right["translation"].values[0].split("\n")
286
+ contents = list(zip_longest(contents1, contents2))[:15]
287
  st.table(
288
  pd.DataFrame(
289
  contents,
290
+ columns=[trans_model_left, trans_model_right],
291
  )
292
  )
293
 
 
297
  st.markdown(
298
  """### Bfloat16 datatype requires loss regularization
299
 
300
+ When training models with `bfloat16` and without loss regularization (default in the HuggingFace pre-training script),
301
+ the training losses would plateau or diverge. The graph below displays the results of different attempts
302
  to train [t5-small-24L-dutch-english](https://huggingface.co/yhavinga/t5-small-24L-dutch-english).
303
  The legend indicates the optimizer, data type, learning rate, total batch size, and learning rate schedule used.
304
  As you can see, all attempts to train with `bfloat16` failed.
 
314
 
315
  The `z_loss` regularization term in the T5X loss function looks like L2 regularization.
316
  (See e.g. Andrej Karpathy [explaining regularization loss](https://youtu.be/PaCmpygFfXo?t=6720)).
317
+ The Optax optimizer library (used in the HuggingFace script), mentions weight decay for AdaFactor (and AdamW)
318
  but also mentions that L2 regularization does not work as expected with adaptive gradient
319
  algorithms. It might be the case that setting a non-zero `weight_decay_rate` in the Optax Adafactor call
320
  in the HuggingFace pre-training script is an alternative to adding the `z_loss` term, to solve the bfloat16 issues, but
 
376
  """### Pre-training with sequence length 512 or 1024
377
 
378
  The models `t5-v1_1-base-dutch-english-cased` and `t5-v1_1-base-dutch-english-cased-1024` have the same model dimensions,
379
+ but are pre-trained with span corruption on different sequence lenghts, 512 and 1024 respectively.
380
  The evaluation loss and accuracy of the models do not look too different. Since training of the 1024 sequence length model was
381
  very slow and didn't converge, I stopped it early. The figure below shows the evaluation
382
  loss and accuracy.
 
394
  st.markdown(
395
  """## Model lists
396
 
 
 
 
 
397
  ### UL2 Dutch English
398
 
399
  These models have been trained with T5X on mc4_nl_cleaned, books, Wikipedia and news.
 
470
  | *eval loss* | 1,38 | 1,20 | 0,96 | 1,07 | 1,11 | 1,13 | 1,18 | 1,27 | 1,05 | 1,3019 | 1,15 |
471
  | *eval acc* | 0,70 | 0,73 | 0,78 | 0,76 | 0,75 | 0,74 | 0,74 | 0,72 | 0,76 | 0,71 | 0,74 |
472
 
 
 
 
 
 
 
 
 
 
 
473
  ### Fine-tuned translation models on ccmatrix
474
 
475
  The models `t5-small-24L-dutch-english` and `t5-base-36L-dutch-english` have been fine-tuned for both language
476
  directions on the first 25M samples from CCMatrix, giving a total of 50M training samples.
477
  Evaluation is performed on out-of-sample CCMatrix and also on Tatoeba and Opus Books.
478
+ The `_bp` columns list the *brevity penalty* (the low score of the 128 seq len models on opus books may be because of the brevity penalty;
479
+ books tend to have longer sentences than 128 tokens). The `avg_bleu` score is the bleu score
480
  averaged over all three evaluation datasets. The best scores displayed in bold for both translation directions.
481
 
482
+
483
  | | [t5-base-36L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-base-36L-ccmatrix-multi) | [t5-base-36L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-base-36L-ccmatrix-multi) | [t5-small-24L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-small-24L-ccmatrix-multi) | [t5-small-24L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-small-24L-ccmatrix-multi) |
484
  |:-----------------------|:-----------------------------|:-----------------------------|:------------------------------|:------------------------------|
485
  | *source_lang* | en | nl | en | nl |
 
508
 
509
  ## Acknowledgements
510
 
511
+ This project was made possible by the exceptional computing resources provided by Google's
512
+ [TPU Research Cloud](https://sites.research.google/trc/).
513
+ The HuggingFace πŸ€— ecosystem of datasets, hub, model architectures
514
+ and example scripts were an integral part of the training process, while Weights & Biases provided the ability
515
+ to track multiple training sessions and execute hyperparameter optimization with insightful visualizations.
516
+ I am grateful to the [https://huggingface.co/Finnish-NLP](Finnish-NLP) authors for their generosity in releasing the UL2 objective code and task
517
+ definitions, and to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for his support in getting me started with the T5X framework.
518
+ Lastly, I want to express my gratitude to Google for their openness and generosity in releasing T5X and related repositories.
519
 
520
  Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
521
+ Some of the sentences were reworded by ChatGPT.
522
  """
523
  )
524
 
requirements.txt CHANGED
@@ -14,4 +14,4 @@ flax>=0.5.3
14
  sentencepiece
15
  matplotlib
16
  seaborn
17
- streamlit>=1.17.0
 
14
  sentencepiece
15
  matplotlib
16
  seaborn
17
+ streamlit==1.10.0