Spaces:

yhavinga
/

pre-training-dutch-t5-models

Running

App Files Files Community

yhavinga commited on Feb 12, 2023

Commit

37e7b34

•

1 Parent(s): fa73be6

Small updates

Browse files

Files changed (6) hide show

INTRO.md +1 -1
PRETRAINING.md +1 -6
README.md +1 -0
REMARKS.md +6 -8
app.py +133 -56
requirements.txt +1 -1

INTRO.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Dutch T5 models : UL2, T5, ByT5 and Long-T5 🇳🇱🇧🇪
-TL;DR: Dutch T5 and UL2 Models Trained with Google's TPU Research Cloud and mC4 Dataset Show Outstanding Performance in NLP Tasks.
 See below for model lists and comparison.
 During the [HuggingFace Flax/Jax community week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104) in the summer of 2021,

 # Dutch T5 models : UL2, T5, ByT5 and Long-T5 🇳🇱🇧🇪
+TL;DR: Dutch NLP gets a boost with state-of-the-art T5 models trained on the largest Dutch corpus, mC4, and additional datasets.
 See below for model lists and comparison.
 During the [HuggingFace Flax/Jax community week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104) in the summer of 2021,

PRETRAINING.md CHANGED Viewed

@@ -8,15 +8,10 @@ It was made available by AllenNLP on the HuggingFace Dataset hub.
 Our team confirmed that the Dutch portion of the mC4 dataset was deduplicated,
 and we cleaned the Dutch portion of the mC4 dataset using [code adapted](https://gitlab.com/yhavinga/c4nlpreproc) from the TensorFlow C4 dataset.
 The resulting [mc4_nl_cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned) dataset on the HuggingFace hub
-has configs for several sizes, and also configs for mixed Dutch and English
 texts, e.g. [micro_en_nl](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned/viewer/micro_en_nl/train).
 The `_en_nl` configs were added to accommodate multi-language pre-training
 with the Huggingface pre-training script, that accepts only a single dataset as input.
-Cleaned English C4 is roughly 5 times larger than its Dutch counterpart. Therefore,
-interleaving the datasets in a 1:1 ratio results in discarding approximately 80% of the English data.
-(When pre-training with T5X and SeqIO, it is possible to define task mixtures that include multiple datasets,
-so these `_en_nl` configs are not needed.)
 The full, cleaned Dutch mC4 dataset is 151GB and remains (as of June 2022) the largest available Dutch
 corpus on the HuggingFace Dataset hub.

 Our team confirmed that the Dutch portion of the mC4 dataset was deduplicated,
 and we cleaned the Dutch portion of the mC4 dataset using [code adapted](https://gitlab.com/yhavinga/c4nlpreproc) from the TensorFlow C4 dataset.
 The resulting [mc4_nl_cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned) dataset on the HuggingFace hub
+has configs for several sizes, and also configs for interleaved mixed Dutch and English
 texts, e.g. [micro_en_nl](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned/viewer/micro_en_nl/train).
 The `_en_nl` configs were added to accommodate multi-language pre-training
 with the Huggingface pre-training script, that accepts only a single dataset as input.
 The full, cleaned Dutch mC4 dataset is 151GB and remains (as of June 2022) the largest available Dutch
 corpus on the HuggingFace Dataset hub.

README.md CHANGED Viewed

@@ -4,6 +4,7 @@ emoji: 🚀
 colorFrom: blue
 colorTo: pink
 sdk: streamlit
 pinned: false
 app_file: app.py
 license: afl-3.0

 colorFrom: blue
 colorTo: pink
 sdk: streamlit
+sdk_version: 1.10.0
 pinned: false
 app_file: app.py
 license: afl-3.0

REMARKS.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ## Miscellaneous remarks
-* Use loss regularization if you train with `bfloat16` (more info below)
-* Beware of the dropout rate in the config.json file.
   Check in a model's `config.json` what the dropout rate has been set to. Unless you
   intend to run many epochs on the same data, its worth to try a training run without dropout.
   If you want to compare losses, be sure to set the dropout rate equal.
@@ -9,14 +9,12 @@
 * Training with more layers is much slower than you'd expect from the increased model size.
   It is also more difficult to get batch size and learning rate right. Below is a section
   about finding the right hyperparameters for the base-36L training.
-* The 'larger' models are not only harder to pre-train, but also harder to fine-tune. The optimizer eats up a lot of
-  space, and the amount of memory required also depends on the length of source and target sequences.
-* PyCharms remote debugging features are useful to inspect variables on either a TPU VM or your deep-learning rig.
 * When increasing the batch size, increase the learning rate. bs * 2 -> lr * sqrt(2) is a good heuristic but mileage may
   vary.
-* Translation evaluation: the low score of the 128 seq len models on opus books may be because of the brevity penaly...
-  that books may have sentences longer than 128 tokens.
 * Dataset quality is a key success factor. Do not expect a model to magically turn mediocre data into magic. This holds for
   the pre-training data, fine-tuning and also evaluating.
 * Good Bleu score does not necessarily mean fluent text. Evaluation loss on a large translation dataset might be
-  better suited for model comparison.

 ## Miscellaneous remarks
+* Use loss regularization when training with `bfloat16` for better results (more info below).
+* Be cautious of the dropout rate in the config.json file and consider training without it.
   Check in a model's `config.json` what the dropout rate has been set to. Unless you
   intend to run many epochs on the same data, its worth to try a training run without dropout.
   If you want to compare losses, be sure to set the dropout rate equal.
 * Training with more layers is much slower than you'd expect from the increased model size.
   It is also more difficult to get batch size and learning rate right. Below is a section
   about finding the right hyperparameters for the base-36L training.
+* For the translation task, I am not sure that a 'deep-narrow' model (e.g. base-nl36) is better than a normal model
+  of comparable size (e.g. `large`).
+* PyCharm's remote debugging features are useful to inspect variables on either a TPU VM or your deep-learning rig.
 * When increasing the batch size, increase the learning rate. bs * 2 -> lr * sqrt(2) is a good heuristic but mileage may
   vary.
 * Dataset quality is a key success factor. Do not expect a model to magically turn mediocre data into magic. This holds for
   the pre-training data, fine-tuning and also evaluating.
 * Good Bleu score does not necessarily mean fluent text. Evaluation loss on a large translation dataset might be
+  better suited for model comparison, if the models have a tokenizer of comparable size.

app.py CHANGED Viewed

@@ -1,4 +1,5 @@
 from glob import glob
 import sqlite3
 import psutil
 import streamlit as st
@@ -11,7 +12,7 @@ IMAGE_WIDTHS = 900
 PRE_TRAINED_DB = "data/pretrained.sqlite"
-@st.cache_data
 def load_eval_data():
     conn = sqlite3.connect(PRE_TRAINED_DB)
     conn.row_factory = lambda c, r: {
@@ -35,17 +36,30 @@ def load_eval_data():
         columns={"summ_rouge1": "summ Rouge1", "trans_en_nl_score": "en->nl Bleu"},
         inplace=True,
     )
-    # for each model, read the summary text
     for i, row in df.iterrows():
-        dirs = glob(f"data/eval_summ_results/{row['id']}-{row['name']}/yhavinga_cnn_dailymail_dutch/eval_predictions*")
         try:
             file = dirs[-1] + "/generated.txt"
             with open(file, "r") as f:
-                text = str(row["id"]) + " " + f.read().replace("<n>", " ")
         except Exception:
             text = "fine-tune failed, no data"
         df.at[i, "summary"] = text
     # order df by the name column desc
     df.sort_values(by="name", inplace=True, ascending=False)
@@ -105,12 +119,18 @@ mT5 green and the other models black.
     )
     col1, col2 = st.columns(2)
     with col1:
-        ul2_enabled = st.checkbox("UL2 Dutch (and English) (trained with T5X)", value=True)
         t5_1_1_enabled = st.checkbox("t5_1_1 Dutch (trained with T5X)", value=True)
         flan_enabled = st.checkbox("Flan T5 (google/flan-t5-*)", value=True)
         mt5_enabled = st.checkbox("mt5 (google/mt5-*)", value=True)
-        long_t5_enabled = st.checkbox("Long T5 Dutch+English (trained with HuggingFace script)")
-        t5_v1_1_enabled = st.checkbox("T5 Dutch (and English) (trained with HuggingFace script)")
     with col2:
         small_enabled = st.checkbox("small model sizes")
         base_enabled = st.checkbox("base model sizes")
@@ -126,15 +146,51 @@ mT5 green and the other models black.
         | (plot_df["name"].str.contains("mt5") & mt5_enabled)
         | (plot_df["name"].str.contains("long-t5") & long_t5_enabled)
         | (plot_df["name"].str.contains("t5_1_1") & t5_1_1_enabled)
-        | ((plot_df["name"].str.startswith("t5") & ~plot_df["name"].str.startswith("t5_1_1")) & t5_v1_1_enabled)
-        | (plot_df["name"].str.contains("base") & base_enabled & ~plot_df["name"].str.contains("36"))
-        | (plot_df["name"].str.contains("small") & small_enabled & ~plot_df["name"].str.contains("24"))
-        | (plot_df["name"].str.contains("large") & large_enabled & ~plot_df["name"].str.contains("8"))
-        | ((plot_df["name"].str.contains("-36L") | plot_df["name"].str.contains("nl36")) & _36_enabled)
-        | ((plot_df["name"].str.contains("-24L") | plot_df["name"].str.contains("nl24")) & _24_enabled)
-        | ((plot_df["name"].str.contains("-8l") | plot_df["name"].str.contains("nl8")) & _8l_enabled)
-        | ((plot_df["name"].str.contains("-4L") | plot_df["name"].str.contains("nl4")) & _4xl_enabled)
-        ]
     color_dict = {"flan": "red", "ul2": "blue", "mt5": "green", "t5_1_1": "orange"}
     colors = [
@@ -165,45 +221,73 @@ mT5 green and the other models black.
         )
     plt.tight_layout()
     st.pyplot(fig)
-    st.markdown("""* The `UL2` pre-trained Dutch(English) models consistently outperform the `T5-*` Dutch(English) models.
 * Flan models perform almost instantly well on the summarization task, with `flan-t5-small`
   showing performance comparable to Dutch T5 base models.
-* Fine-tuning of `t5-v1.1-large-dutch-cased` failed with the fixed hyperparameters across all models.
-  Since the `UL2` models are better across the board, I've disabled this model on the hub.
-* I am surprised by the consistent bad scores for the `long-t5` runs. I've retried the fine-tuning of these models with
-  `float32` instead of `bfloat16`, but the results were the same. Maybe this is normal behaviour for these models
-  targeted at dealing with longer sequence lengths.
 * For the translation task from English to Dutch, the Dutch+English pre-trained models perform well. Also
   `UL2 Dutch` pre-trained Dutch models are consistently better than their `Flan`, `T5 Dutch` and
   `mT5` counterparts of the comparable size.
-* For the translation task, I am not sure that a 'deep-narrow' model (e.g. base-nl36) is better than a normal model
-  or even a 'wide-deep' model.
 * The `long-t5` models show bad performance on both tasks.
   I cannot explain this the translation task. With a sequence length of 128 input and output
   tokens, the sliding attention window with radius length 127 of the `long-t5` models should be able to handle this.
-""")
-    st.markdown("### Compare generated summaries")
     col1, col2 = st.columns(2)
     with col1:
-        model_left = st.selectbox("Choose left model", df["name"], index=6)
     with col2:
-        model_right = st.selectbox("Choose right model", df["name"], index=33)
-    @st.cache_resource
     def get_row(model):
         return df[df["name"] == model]
-    row_left = get_row(model_left)
-    row_right = get_row(model_right)
     contents1 = row_left["summary"].values[0].split("\n")
     contents2 = row_right["summary"].values[0].split("\n")
-    contents = list(zip(contents1, contents2))[:5]
     st.table(
         pd.DataFrame(
             contents,
-            columns=[model_left, model_right],
         )
     )
@@ -213,8 +297,8 @@ mT5 green and the other models black.
     st.markdown(
         """### Bfloat16 datatype requires loss regularization
-When training models with `bfloat16` and without loss regularization (default), the training losses would plateau or
-diverge. The graph below displays the results of different attempts
 to train [t5-small-24L-dutch-english](https://huggingface.co/yhavinga/t5-small-24L-dutch-english).
 The legend indicates the optimizer, data type, learning rate, total batch size, and learning rate schedule used.
 As you can see, all attempts to train with `bfloat16` failed.
@@ -230,7 +314,7 @@ and the `bfloat16` training runs did not exhibit the problems illustrated above
 The `z_loss` regularization term in the T5X loss function looks like L2 regularization.
 (See e.g. Andrej Karpathy [explaining regularization loss](https://youtu.be/PaCmpygFfXo?t=6720)).
-The Optax optimizer, used in the HuggingFace script, mentions weight decay for AdaFactor (and AdamW)
 but also mentions that L2 regularization does not work as expected with adaptive gradient
 algorithms. It might be the case that setting a non-zero `weight_decay_rate` in the Optax Adafactor call
 in the HuggingFace pre-training script is an alternative to adding the `z_loss` term, to solve the bfloat16 issues, but
@@ -292,7 +376,7 @@ models to converge during fine-tuning.
         """### Pre-training with sequence length 512 or 1024
 The models `t5-v1_1-base-dutch-english-cased` and `t5-v1_1-base-dutch-english-cased-1024` have the same model dimensions,
-but are pre-trained on different sequence lenghts, 512 and 1024 respectively.
 The evaluation loss and accuracy of the models do not look too different. Since training of the 1024 sequence length model was
 very slow and didn't converge, I stopped it early. The figure below shows the evaluation
 loss and accuracy.
@@ -310,10 +394,6 @@ summarization and translation.
     st.markdown(
         """## Model lists
-### t5_1_1
-TODO
 ### UL2 Dutch English
 These models have been trained with T5X on mc4_nl_cleaned, books, Wikipedia and news.
@@ -390,24 +470,16 @@ the several dimensions of these models.
 | *eval loss* | 1,38            | 1,20                         | 0,96                       | 1,07                        | 1,11                               | 1,13                                    | 1,18                         | 1,27                           | 1,05                              | 1,3019                             | 1,15                                  |
 | *eval acc* | 0,70            | 0,73                         | 0,78                       | 0,76                        | 0,75                               | 0,74                                    | 0,74                         | 0,72                           | 0,76                              | 0,71                               | 0,74                                  |
-### Long-T5 models
-These models have been trained with the HuggingFace 🤗 run_t5_mlm_flax.py script on mc4_nl_cleaned.
-### Byt5 small
-This model has been trained with the HuggingFace 🤗 run_t5_mlm_flax.py script on mc4_nl_cleaned.
-TODO
 ### Fine-tuned translation models on ccmatrix
 The models `t5-small-24L-dutch-english` and `t5-base-36L-dutch-english` have been fine-tuned for both language
 directions on the first 25M samples from CCMatrix, giving a total of 50M training samples.
 Evaluation is performed on out-of-sample CCMatrix and also on Tatoeba and Opus Books.
-The `_bp` columns list the *brevity penalty*. The `avg_bleu` score is the bleu score
 averaged over all three evaluation datasets. The best scores displayed in bold for both translation directions.
 |                        | [t5-base-36L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-base-36L-ccmatrix-multi)   | [t5-base-36L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-base-36L-ccmatrix-multi)   | [t5-small-24L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-small-24L-ccmatrix-multi)   | [t5-small-24L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-small-24L-ccmatrix-multi)   |
 |:-----------------------|:-----------------------------|:-----------------------------|:------------------------------|:------------------------------|
 | *source_lang* | en                           | nl                           | en                            | nl                            |
@@ -436,12 +508,17 @@ averaged over all three evaluation datasets. The best scores displayed in bold f
 ## Acknowledgements
-This project would not have been possible without compute generously provided by Google through the
-[TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was instrumental in all parts
-of the training. Weights & Biases made it possible to keep track of many training sessions
-and orchestrate hyperparameter sweeps with insightful visualizations.
 Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
 """
     )

 from glob import glob
+from itertools import zip_longest
 import sqlite3
 import psutil
 import streamlit as st
 PRE_TRAINED_DB = "data/pretrained.sqlite"
+@st.cache
 def load_eval_data():
     conn = sqlite3.connect(PRE_TRAINED_DB)
     conn.row_factory = lambda c, r: {
         columns={"summ_rouge1": "summ Rouge1", "trans_en_nl_score": "en->nl Bleu"},
         inplace=True,
     )
     for i, row in df.iterrows():
+        dirs = glob(
+            f"data/eval_summ_results/{row['id']}-{row['name']}/yhavinga_cnn_dailymail_dutch/eval_predictions*"
+        )
         try:
             file = dirs[-1] + "/generated.txt"
             with open(file, "r") as f:
+                text = f.read().replace("<n>", " ")
         except Exception:
             text = "fine-tune failed, no data"
         df.at[i, "summary"] = text
+    for i, row in df.iterrows():
+        dirs = glob(
+            f"data/eval_transl_results/{row['id']}-{row['name']}/yhavinga_ccmatrix/eval_predictions*"
+        )
+        try:
+            file = dirs[-1] + "/generated.txt"
+            with open(file, "r") as f:
+                text = f.read().replace("<n>", " ")
+        except Exception:
+            text = "fine-tune failed, no data"
+        df.at[i, "translation"] = text
     # order df by the name column desc
     df.sort_values(by="name", inplace=True, ascending=False)
     )
     col1, col2 = st.columns(2)
     with col1:
+        ul2_enabled = st.checkbox(
+            "UL2 Dutch (and English) (trained with T5X)", value=True
+        )
         t5_1_1_enabled = st.checkbox("t5_1_1 Dutch (trained with T5X)", value=True)
         flan_enabled = st.checkbox("Flan T5 (google/flan-t5-*)", value=True)
         mt5_enabled = st.checkbox("mt5 (google/mt5-*)", value=True)
+        long_t5_enabled = st.checkbox(
+            "Long T5 Dutch+English (trained with HuggingFace script)"
+        )
+        t5_v1_1_enabled = st.checkbox(
+            "T5 Dutch (and English) (trained with HuggingFace script)"
+        )
     with col2:
         small_enabled = st.checkbox("small model sizes")
         base_enabled = st.checkbox("base model sizes")
         | (plot_df["name"].str.contains("mt5") & mt5_enabled)
         | (plot_df["name"].str.contains("long-t5") & long_t5_enabled)
         | (plot_df["name"].str.contains("t5_1_1") & t5_1_1_enabled)
+        | (
+            (
+                plot_df["name"].str.startswith("t5")
+                & ~plot_df["name"].str.startswith("t5_1_1")
+            )
+            & t5_v1_1_enabled
+        )
+        | (
+            plot_df["name"].str.contains("base")
+            & base_enabled
+            & ~plot_df["name"].str.contains("36")
+        )
+        | (
+            plot_df["name"].str.contains("small")
+            & small_enabled
+            & ~plot_df["name"].str.contains("24")
+        )
+        | (
+            plot_df["name"].str.contains("large")
+            & large_enabled
+            & ~plot_df["name"].str.contains("8")
+        )
+        | (
+            (
+                plot_df["name"].str.contains("-36L")
+                | plot_df["name"].str.contains("nl36")
+            )
+            & _36_enabled
+        )
+        | (
+            (
+                plot_df["name"].str.contains("-24L")
+                | plot_df["name"].str.contains("nl24")
+            )
+            & _24_enabled
+        )
+        | (
+            (plot_df["name"].str.contains("-8l") | plot_df["name"].str.contains("nl8"))
+            & _8l_enabled
+        )
+        | (
+            (plot_df["name"].str.contains("-4L") | plot_df["name"].str.contains("nl4"))
+            & _4xl_enabled
+        )
+    ]
     color_dict = {"flan": "red", "ul2": "blue", "mt5": "green", "t5_1_1": "orange"}
     colors = [
         )
     plt.tight_layout()
     st.pyplot(fig)
+    st.markdown(
+        """* The `UL2` pre-trained Dutch(English) models consistently outperform the `T5-*` Dutch(English) models.
 * Flan models perform almost instantly well on the summarization task, with `flan-t5-small`
   showing performance comparable to Dutch T5 base models.
 * For the translation task from English to Dutch, the Dutch+English pre-trained models perform well. Also
   `UL2 Dutch` pre-trained Dutch models are consistently better than their `Flan`, `T5 Dutch` and
   `mT5` counterparts of the comparable size.
+* Fine-tuning of `t5-v1.1-large-dutch-cased` failed with the fixed hyperparameters across all models.
+  Since the `UL2` models are better across the board, I've disabled this model on the hub.
 * The `long-t5` models show bad performance on both tasks.
   I cannot explain this the translation task. With a sequence length of 128 input and output
   tokens, the sliding attention window with radius length 127 of the `long-t5` models should be able to handle this.
+  I've retried the fine-tuning of these models with
+  `float32` instead of `bfloat16`, but the results were the same. Maybe this is normal behaviour for these models
+  targeted at dealing with longer sequence lengths.
+"""
+    )
+    st.markdown("### Compare generated texts")
     col1, col2 = st.columns(2)
     with col1:
+        summ_model_left = st.selectbox(
+            "Choose left summarization model", df["name"], index=6
+        )
     with col2:
+        summ_model_right = st.selectbox(
+            "Choose right summarization model", df["name"], index=33
+        )
+    @st.cache
     def get_row(model):
         return df[df["name"] == model]
+    row_left = get_row(summ_model_left)
+    row_right = get_row(summ_model_right)
     contents1 = row_left["summary"].values[0].split("\n")
     contents2 = row_right["summary"].values[0].split("\n")
+    contents = list(zip_longest(contents1, contents2))[:5]
+    st.table(
+        pd.DataFrame(
+            contents,
+            columns=[summ_model_left, summ_model_right],
+        )
+    )
+    st.markdown("### Compare generated translations")
+    col1, col2 = st.columns(2)
+    with col1:
+        trans_model_left = st.selectbox("Choose left model", df["name"], index=3)
+    with col2:
+        trans_model_right = st.selectbox("Choose right model", df["name"], index=32)
+    @st.cache
+    def get_row(model):
+        return df[df["name"] == model]
+    row_left = get_row(trans_model_left)
+    row_right = get_row(trans_model_right)
+    contents1 = row_left["translation"].values[0].split("\n")
+    contents2 = row_right["translation"].values[0].split("\n")
+    contents = list(zip_longest(contents1, contents2))[:15]
     st.table(
         pd.DataFrame(
             contents,
+            columns=[trans_model_left, trans_model_right],
         )
     )
     st.markdown(
         """### Bfloat16 datatype requires loss regularization
+When training models with `bfloat16` and without loss regularization (default in the HuggingFace pre-training script),
+the training losses would plateau or diverge. The graph below displays the results of different attempts
 to train [t5-small-24L-dutch-english](https://huggingface.co/yhavinga/t5-small-24L-dutch-english).
 The legend indicates the optimizer, data type, learning rate, total batch size, and learning rate schedule used.
 As you can see, all attempts to train with `bfloat16` failed.
 The `z_loss` regularization term in the T5X loss function looks like L2 regularization.
 (See e.g. Andrej Karpathy [explaining regularization loss](https://youtu.be/PaCmpygFfXo?t=6720)).
+The Optax optimizer library (used in the HuggingFace script), mentions weight decay for AdaFactor (and AdamW)
 but also mentions that L2 regularization does not work as expected with adaptive gradient
 algorithms. It might be the case that setting a non-zero `weight_decay_rate` in the Optax Adafactor call
 in the HuggingFace pre-training script is an alternative to adding the `z_loss` term, to solve the bfloat16 issues, but
         """### Pre-training with sequence length 512 or 1024
 The models `t5-v1_1-base-dutch-english-cased` and `t5-v1_1-base-dutch-english-cased-1024` have the same model dimensions,
+but are pre-trained with span corruption on different sequence lenghts, 512 and 1024 respectively.
 The evaluation loss and accuracy of the models do not look too different. Since training of the 1024 sequence length model was
 very slow and didn't converge, I stopped it early. The figure below shows the evaluation
 loss and accuracy.
     st.markdown(
         """## Model lists
 ### UL2 Dutch English
 These models have been trained with T5X on mc4_nl_cleaned, books, Wikipedia and news.
 | *eval loss* | 1,38            | 1,20                         | 0,96                       | 1,07                        | 1,11                               | 1,13                                    | 1,18                         | 1,27                           | 1,05                              | 1,3019                             | 1,15                                  |
 | *eval acc* | 0,70            | 0,73                         | 0,78                       | 0,76                        | 0,75                               | 0,74                                    | 0,74                         | 0,72                           | 0,76                              | 0,71                               | 0,74                                  |
 ### Fine-tuned translation models on ccmatrix
 The models `t5-small-24L-dutch-english` and `t5-base-36L-dutch-english` have been fine-tuned for both language
 directions on the first 25M samples from CCMatrix, giving a total of 50M training samples.
 Evaluation is performed on out-of-sample CCMatrix and also on Tatoeba and Opus Books.
+The `_bp` columns list the *brevity penalty* (the low score of the 128 seq len models on opus books may be because of the brevity penalty;
+books tend to have longer sentences than 128 tokens). The `avg_bleu` score is the bleu score
 averaged over all three evaluation datasets. The best scores displayed in bold for both translation directions.
 |                        | [t5-base-36L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-base-36L-ccmatrix-multi)   | [t5-base-36L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-base-36L-ccmatrix-multi)   | [t5-small-24L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-small-24L-ccmatrix-multi)   | [t5-small-24L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-small-24L-ccmatrix-multi)   |
 |:-----------------------|:-----------------------------|:-----------------------------|:------------------------------|:------------------------------|
 | *source_lang* | en                           | nl                           | en                            | nl                            |
 ## Acknowledgements
+This project was made possible by the exceptional computing resources provided by Google's
+[TPU Research Cloud](https://sites.research.google/trc/).
+The HuggingFace 🤗 ecosystem of datasets, hub, model architectures
+and example scripts were an integral part of the training process, while Weights & Biases provided the ability
+to track multiple training sessions and execute hyperparameter optimization with insightful visualizations.
+I am grateful to the [https://huggingface.co/Finnish-NLP](Finnish-NLP) authors for their generosity in releasing the UL2 objective code and task
+definitions, and to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for his support in getting me started with the T5X framework.
+Lastly, I want to express my gratitude to Google for their openness and generosity in releasing T5X and related repositories.
 Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
+Some of the sentences were reworded by ChatGPT.
 """
     )

requirements.txt CHANGED Viewed

@@ -14,4 +14,4 @@ flax>=0.5.3
 sentencepiece
 matplotlib
 seaborn
-streamlit>=1.17.0

 sentencepiece
 matplotlib
 seaborn
+streamlit==1.10.0