pere commited on
Commit
bcae918
1 Parent(s): 9d59806

Updated README

Browse files
Files changed (1) hide show
  1. README.md +25 -20
README.md CHANGED
@@ -12,11 +12,16 @@ datasets:
12
  - mc4
13
  - wikipedia
14
 
 
 
 
 
 
15
  license: other
16
  ---
17
 
18
- # North-T5
19
- The North-T5 is a set of Norwegian sequence-to-sequence-models. It builds upon the flexible T5 text-to-text platform and can be used for a variety of NLP tasks ranging from classification to translation. __**Click the icons to navigate**__.
20
 
21
  | |**Small** <br />_60M_|**Base** <br />_220M_|**Large** <br />_770M_|**XL** <br />_3B_|**XXL** <br />_11B_|
22
  |:-----------|:------------:|:------------:|:------------:|:------------:|:------------:|
@@ -33,7 +38,7 @@ The original T5X checkpoint is also available for this model in the [Google Clou
33
 
34
 
35
  ## Performance
36
- A thorough evaluation of the North-T5 models is planned. I strongly recommend any external researchers to make their own evaluation. The main advantage with the T5-models are their flexibility. Traditionally, encoder-only models (like BERT) excels in classification tasks, while seq-2-seq models are easier to train for tasks like translation and Q&A. Despite this, here are the results from using North-T5 on the political classification task explained [here](https://arxiv.org/abs/2104.09617).
37
 
38
  |**Model:** | **F1** |
39
  |:-----------|:------------|
@@ -47,50 +52,50 @@ A thorough evaluation of the North-T5 models is planned. I strongly recommend an
47
  |North-T5-xl|88.7 |
48
  |North-T5-xxl|91.8|
49
 
50
- This is preliminary results. The [results](https://arxiv.org/abs/2104.09617) from the BERT-models are based on the test-results from the best model after 10 runs with early stopping and a decaying learning rate. The T5-results are the average of five runs on the evaluation set. The small-model was trained for 10.000 steps, while the rest for 5.000 steps. A fixed learning rate was used (no decay), and no early stopping. Neither was the recommended rank classification used. We use a max sequence length of 512. This method simplifies the test setup and gives results that are easy to interpret. However, the results from the T5 model might actually be a bit sub-optimal.
51
 
52
  ## Sub-versions of North-T5
53
- The following sub-versions are available. Other versions will be available shorter.
54
 
55
  |**Model** | **Description** |
56
  |:-----------|:-------|
57
  |**North&#8209;T5&#8209;NCC** |This is the main version. It is trained an additonal 500.000 steps on from the mT5 checkpoint. The training corpus is based on [the Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC). In addition there are added data from MC4 and English Wikipedia.|
58
- |**North&#8209;T5&#8209;NCC&#8209;lm**|Pretrained for an addtional 100k steps on the LM objective discussed in the [T5 paper](https://arxiv.org/pdf/1910.10683.pdf). In a way this turns a masked language model into an autoregressive model. It also prepares the model for some tasks. When for instance doing translation and NLI, it is well documented that there is a clear benefit to do a step of unsupervised LM-training before starting the finetuning.|
59
- |**North&#8209;T5&#8209;NCC&#8209;modern**| Pretrained for an additional 200k steps on a blanaced Bokmål and Nynorsk corpus. While original made for doing translation between Bokmål and Nynorsk, it might also give improved results on tasks where you know that the input/output is modern "standard" text. A significant part of the training corpus is newspapers and reports.|
60
- |**North&#8209;T5&#8209;NCC&#8209;modern&#8209;lm**| As above but with the extra 100k language model pretraining.|
61
- |**North&#8209;T5&#8209;NCC&#8209;scand**|Pretrained for an additional 200k steps on a corpus with the Scandinavian languages (Bokmål, Nynorsk, Danish, Swedish and Icelandic (+ a tiny bit Faeroyish)). The model was trained for increasing the understanding of what effect such training has on various languages.|
62
  |**North&#8209;T5&#8209;scand**|Pretrained for 1,700,000 steps starting with the mT5 checkpoing. The purpose of the mode is studying the difference of different training regimes for Scandinavian language model.|
63
- |**North&#8209;byT5&#8209;base**| A vocabulary free version of T5. Trained exactly like North-T5, but instead of the 250,112 vocabulary, this model operates directly on the raw text. The model architecture might be of particulary interest for tasks involving for instance spelling correction, OCR-cleaning, handwriting recognition etc. However, it will, by design, have a shorter maximum sequence length.|
64
 
65
 
66
 
67
 
68
  ## Fine-tuned versions
69
- As explained below, the model really needs to be fine-tuned for specific tasks. This procedure is simple, and the model is not very sensitive to the hyper-parameters used. Usually a decent result can be obtained by using a fixed learning rate of 1e-3. Smaller versions of the model typically needs to be trained for a longer time. It is easy to train the base-models in a Google Colab. I will provide an exampel Notebook on this soon.
70
 
71
  Since some people really want to see what the models are capable of, without going through the training procedure, I provide a couple of test models. These models are by no means optimised, and are just for demonstrating how the North-T5 models can be used.
72
 
73
- * Nynorsk Translator. Translates any text from Norwegian Bokmål to Norwegian Nynorsk.
74
- * De-Uncaser. Puts punctation, spaces and capital letters back into the text. The input needs to be in Norwegian but does not have to be divided into sentences or have proper capitalisation of words. You can even remove the spaces from the text, and make the model reconstruct it.
75
 
76
 
77
  ## Training details
78
- The models are built using the Flax-based T5X codebase, and all models are initiated with the mT5 pretrained weights. The models are trained using the T5.1.1 training regime, where they are only trained on an unsupervised masking-task. This also means that the models (contrary to the original T5) needs to be finetuned to solve specific tasks. This finetuning is however usually not very compute intensive, and in most cases it can be performed even with free online training resources.
79
 
80
- All the main model model versions are trained for 500.000 steps after the mT5 checkpoint (1.000.000 steps). They are all trained mainly on a 75GB corpus, consisting of NCC, Common Crawl and some additional high quality English text (Wikipedia). The corpus is roughly 80% Norwegian text. Additional languages are added to retain some of the multilingual capabilities, making the model both more robust to new words/concepts and also more suited as a basis for translation tasks.
81
 
82
- While the huge models almost always will give the best results, they are also both difficult and expensive to finetune. It is strongly recommended to start with finetuning on a base-models. This can typically easily be finetuned on a standard graphic card or a free TPU through Google Colab. The sub-versions of the North-T5-base model was created with this in mind.
83
 
84
- All models were trained on TPUs. The largest XXL model was trained on a TPU v4-64, the XL model on a TPU v4-32, the Large model on a TPU v4-16 and the rest on TPU v4-8.
85
 
86
  ## Formats
87
- All models are trained using the Flax-based T5X library. The original checkpoints are available in T5X format and can be used for both finetuning or interference. All models, except the XXL-model, are also converted to Transformers/HuggingFace. In this framework, the models are available both in Flax, PyTorch and TensorFlow format.
88
 
89
  ## Future
90
- I will continue to train and release additional models to this set. What models that are added is dependent upon the feedback.
91
 
92
  ## Thanks
93
- This release would not have been possible without getting support from [TPU Research Cloud](https://sites.research.google/trc/about/) at Google Research.
94
 
95
  Freddy Wetjen at the National Library of Norway has been of tremendous help in generating the original NCC corpus, and has also contributed to generate the collated coprus used for this training. In addition he has been a dicussion partner in the creation of these models.
96
 
 
12
  - mc4
13
  - wikipedia
14
 
15
+
16
+ widget:
17
+ - text: <extra_id_0> hver uke samles Regjeringens medlemmer til Statsråd på <extra_id_1>. Dette organet er øverste <extra_id_2> i Norge. For at møtet skal være <extra_id_3>, må over halvparten av regjeringens <extra_id_4> være til stede.
18
+ - text: På <extra_id_0> kan man <extra_id_1> en bok, og man kan også <extra_id_2> seg ned og lese den.
19
+
20
  license: other
21
  ---
22
 
23
+ -T5
24
+ The North-T5-models are a set of Norwegian sequence-to-sequence-models. It builds upon the flexible [T5](https://github.com/google-research/text-to-text-transfer-transformer) and [T5X](https://github.com/google-research/t5x) and can be used for a variety of NLP tasks ranging from classification to translation.
25
 
26
  | |**Small** <br />_60M_|**Base** <br />_220M_|**Large** <br />_770M_|**XL** <br />_3B_|**XXL** <br />_11B_|
27
  |:-----------|:------------:|:------------:|:------------:|:------------:|:------------:|
 
38
 
39
 
40
  ## Performance
41
+ A thorough evaluation of the North-T5 models is planned, and I strongly recommend external researchers to make their own evaluation. The main advantage with the T5-models are their flexibility. Traditionally, encoder-only models (like BERT) excels in classification tasks, while seq-2-seq models are easier to train for tasks like translation and Q&A. Despite this, here are the results from using North-T5 on the political classification task explained [here](https://arxiv.org/abs/2104.09617).
42
 
43
  |**Model:** | **F1** |
44
  |:-----------|:------------|
 
52
  |North-T5-xl|88.7 |
53
  |North-T5-xxl|91.8|
54
 
55
+ These are preliminary results. The [results](https://arxiv.org/abs/2104.09617) from the BERT-models are based on the test-results from the best model after 10 runs with early stopping and a decaying learning rate. The T5-results are the average of five runs on the evaluation set. The small-model was trained for 10.000 steps, while the rest for 5.000 steps. A fixed learning rate was used (no decay), and no early stopping. Neither was the recommended rank classification used. We use a max sequence length of 512. This method simplifies the test setup and gives results that are easy to interpret. However, the results from the T5 model might actually be a bit sub-optimal.
56
 
57
  ## Sub-versions of North-T5
58
+ The following sub-versions are available. More versions will be available shorter.
59
 
60
  |**Model** | **Description** |
61
  |:-----------|:-------|
62
  |**North&#8209;T5&#8209;NCC** |This is the main version. It is trained an additonal 500.000 steps on from the mT5 checkpoint. The training corpus is based on [the Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC). In addition there are added data from MC4 and English Wikipedia.|
63
+ |**North&#8209;T5&#8209;NCC&#8209;lm**|The model is pretrained for an addtional 100k steps on the LM objective discussed in the [T5 paper](https://arxiv.org/pdf/1910.10683.pdf). In a way this turns a masked language model into an autoregressive model. It also prepares the model for some tasks. When for instance doing translation and NLI, it is well documented that there is a clear benefit to do a step of unsupervised LM-training before starting the finetuning.|
64
+ |**North&#8209;T5&#8209;NCC&#8209;modern**| The model is pretrained for an additional 200k steps on a blanaced Bokmål and Nynorsk corpus. While this was originally done for doing translation between Bokmål and Nynorsk, it might also give improved results on tasks where you know that the input/output is modern "standard" text. A significant part of the training corpus is newspapers and reports.|
65
+ |**North&#8209;T5&#8209;NCC&#8209;modern&#8209;lm**| Trained as above but with an additional 100k "language model"-pretraining.|
66
+ |**North&#8209;T5&#8209;NCC&#8209;scand**|The model is pretrained for an additional 200k steps on a Scandinavian corpus (Bokmål, Nynorsk, Danish, Swedish and Icelandic (+ a tiny bit Faeroyish)). The model was trained for increasing the understanding of what effect such training has on various languages.|
67
  |**North&#8209;T5&#8209;scand**|Pretrained for 1,700,000 steps starting with the mT5 checkpoing. The purpose of the mode is studying the difference of different training regimes for Scandinavian language model.|
68
+ |**North&#8209;byT5&#8209;base**| This is a vocabulary free version of T5. It is trained exactly like North-T5, but instead of the 250,112 vocabulary, this model operates directly on the raw text. The model architecture might be of particulary interest for tasks involving for instance spelling correction, OCR-cleaning, handwriting recognition etc. However, it will - by design - have amuch shorter maximum sequence length.|
69
 
70
 
71
 
72
 
73
  ## Fine-tuned versions
74
+ As explained below, the model really needs to be fine-tuned for specific tasks. This procedure is relatively simple, and the models are not very sensitive to the hyper-parameters used. Usually a decent result can be obtained by using a fixed learning rate of 1e-3. Smaller versions of the model typically needs to be trained for a longer time. It is easy to train the base-models in a Google Colab.
75
 
76
  Since some people really want to see what the models are capable of, without going through the training procedure, I provide a couple of test models. These models are by no means optimised, and are just for demonstrating how the North-T5 models can be used.
77
 
78
+ * Nynorsk Translator. Translates any text from Norwegian Bokmål to Norwegian Nynorsk. Please test the [Streamlit-demo](https://huggingface.co/spaces/north/Nynorsk) and the [HuggingFace repo](https://huggingface.co/north/demo-nynorsk-base)
79
+ * DeUnCaser. The model adds punctation, spaces and capitalisation back into the text. The input needs to be in Norwegian but does not have to be divided into sentences or have proper capitalisation of words. You can even remove the spaces from the text, and make the model reconstruct it. It can be tested with the [Streamlit-demo](https://huggingface.co/spaces/north/DeUnCaser) and directly on the [HuggingFace repo](https://huggingface.co/north/demo-deuncaser-base)
80
 
81
 
82
  ## Training details
83
+ All models are built using the Flax-based T5X codebase, and all models are initiated with the mT5 pretrained weights. The models are trained using the T5.1.1 training regime, where they are only trained on an unsupervised masking-task. This also means that the models (contrary to the original T5) needs to be finetuned to solve specific tasks. This finetuning is however usually not very compute intensive, and in most cases it can be performed even with free online training resources.
84
 
85
+ All the main model model versions are trained for 500.000 steps after the mT5 checkpoint (1.000.000 steps). They are trained mainly on a 75GB corpus, consisting of NCC, Common Crawl and some additional high quality English text (Wikipedia). The corpus is roughly 80% Norwegian text. Additional languages are added to retain some of the multilingual capabilities, making the model both more robust to new words/concepts and also more suited as a basis for translation tasks.
86
 
87
+ While the huge models almost always will give the best results, they are also both more difficult and more expensive to finetune. I will strongly recommended to start with finetuning a base-models. The base-models can easily be finetuned on a standard graphic card or a free TPU through Google Colab.
88
 
89
+ All models were trained on TPUs. The largest XXL model was trained on a TPU v4-64, the XL model on a TPU v4-32, the Large model on a TPU v4-16 and the rest on TPU v4-8. Since it is possible to reduce the batch size during fine-tuning, it is also possible to finetune on slightly smaller hardware. The rule of thumb is that you can go "one step down" when finetuning. The large models still rewuire access to significant hardware, even for finetuning.
90
 
91
  ## Formats
92
+ All models are trained using the Flax-based T5X library. The original checkpoints are available in T5X format and can be used for both finetuning or interference. All models, except the XXL-model, are also converted to Transformers/HuggingFace. In this framework, the models can be loaded for finetuning or inference both in Flax, PyTorch and TensorFlow format.
93
 
94
  ## Future
95
+ I will continue to train and release additional models to this set. What models that are added is dependent upon the feedbacki from the users
96
 
97
  ## Thanks
98
+ This release would not have been possible without getting support and hardware from the [TPU Research Cloud](https://sites.research.google/trc/about/) at Google Research. Both the TPU Research Cloud Team and the T5X Team has provided extremely useful support for getting this running.
99
 
100
  Freddy Wetjen at the National Library of Norway has been of tremendous help in generating the original NCC corpus, and has also contributed to generate the collated coprus used for this training. In addition he has been a dicussion partner in the creation of these models.
101