pere commited on
Commit
87e7c6a
1 Parent(s): 565c508

Updated README

Browse files
Files changed (1) hide show
  1. README.md +31 -16
README.md CHANGED
@@ -1,15 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # North-T5
2
  The North-T5 is a set of Norwegian sequence-to-sequence-models. It builds upon the flexible T5 text-to-text platform and can be used for a variety of NLP tasks ranging from classification to translation.
3
 
 
 
 
 
 
 
 
4
 
5
- ## Main versions - download
6
- |**Model:** | **Parameters** |**Transformers** |**T5X** |
7
- |:-----------|:------------|:------------|:------------|
8
- |North-T5-small|60 million | HuggingFace | GCloud Bucket |
9
- |North-T5-base|220 million | HuggingFace | GCloud Bucket |
10
- |North-T5-large|770 million | HuggingFace | GCloud Bucket |
11
- |North-T5-xl|3 billion | HuggingFace | GCloud Bucket |
12
- |North-T5-xxl|11 billion| N/A | GCloud Bucket |
13
 
14
  ## Performance
15
  A thorough evaluation of the North-T5 models is planned. I strongly recommend any external researchers to make their own evaluation. The main advantage with the T5-models are their flexibility. Traditionally, encoder-only models (like BERT) excels in classification tasks, while seq-2-seq models are easier to train for tasks like translation and Q&A. Despite this, here are the results from using North-T5 on the political classification task explained [here](https://arxiv.org/abs/2104.09617).
@@ -28,15 +44,14 @@ A thorough evaluation of the North-T5 models is planned. I strongly recommend an
28
 
29
  This is preliminary results. The [results](https://arxiv.org/abs/2104.09617) from the BERT-models are based on the test-results from the best model after 10 runs with early stopping and a decaying learning rate. The T5-results are the average of five runs on the evaluation set. The small-model was trained for 10.000 steps, while the rest for 5.000 steps. A fixed learning rate was used (no decay), and no early stopping. Neither was the recommended rank classification used. We use a max sequence length of 512. This method simplifies the test setup and gives results that are easy to interpret. However, the results from the T5 model might actually be a bit sub-optimal.
30
 
31
- ## Sub-versions of North-T5-Base
32
- For making it possible to run experiments on the T5-models, a range of sub-versions are released. These models are currently only available as base-models. However, other model sizes can be made available by request.
 
 
 
 
 
33
 
34
- |**Model:** | **Description** |
35
- |:-----------|:------------|
36
- |North-T5-base-LM |Pretrained for an addtional 100k steps on the LM objective described in Raffel & al. In a way this turns a masked language model into an autoregressive model. It also prepares the model for some tasks. When for instance doing translation and NLI, it is well documented that there is a clear benefit to do a step of unsupervised LM-training before starting the finetuning.|
37
- |North-byT5-base | A vocabulary free version of T5. Trained exactly like North-T5, but instead of the 200.000 vocabulary, this model operates directly on the raw text. The model architecture might be of particulary interest for tasks involving for instance spelling correction, OCR-cleaning, handwriting recognition etc. However, it will, by design, have a shorter maximum sequence length.|
38
- |North-T5-base-modern | Pretrained for an additional 200k steps on a blanaced Bokmål and Nynorsk corpus. While original made for doing translation between Bokmål and Nynorsk, it might also give improved results on tasks where you know that the input/output is modern "standard" text. A significant part of the training corpus is newspapers and reports.|
39
- |North-T5-base-scandinavian |Pretrained for an additional 200k steps on a corpus with the Scandinavian languages (Bokmål, Nynorsk, Danish, Swedish and Icelandic (+ a tiny bit Faeroyish)). The model was trained for increasing the understanding of what effect such training has on various languages.|
40
 
41
  ## Fine-tuned versions
42
  As explained below, the model really needs to be fine-tuned for specific tasks. This procedure is simple, and the model is not very sensitive to the hyper-parameters used. Usually a decent result can be obtained by using a fixed learning rate of 1e-3. Smaller versions of the model typically needs to be trained for a longer time. It is easy to train the base-models in a Google Colab. I will provide an exampel Notebook on this soon.
 
1
+ ---
2
+ language:
3
+ - no
4
+ - nn
5
+ - sv
6
+ - dk
7
+ - is
8
+ - en
9
+
10
+ datasets:
11
+ - nbailab/NCC
12
+ - mc4
13
+ - wikipedia
14
+
15
+ license: apache-2.0
16
+ ---
17
+
18
  # North-T5
19
  The North-T5 is a set of Norwegian sequence-to-sequence-models. It builds upon the flexible T5 text-to-text platform and can be used for a variety of NLP tasks ranging from classification to translation.
20
 
21
+ | |**Small** <br />_60M_|**Base** <br />_220M_|**Large** <br />_770M_|**XL** <br />_3B_|**XXL** <br />_11B_|
22
+ |:-----------|:------------:|:------------:|:------------:|:------------:|:------------:|
23
+ |North-T5&#8209;NCC|[🤗](https://huggingface.co/north/t5_small_NCC)|[🤗](https://huggingface.co/north/t5_base_NCC)|[🤗](https://huggingface.co/north/t5_large_NCC)|[🤗](https://huggingface.co/north/t5_xl_NCC)|[🤗](https://huggingface.co/north/t5_xxl_NCC)||
24
+ |North-T5&#8209;NCC&#8209;lm|[🤗](https://huggingface.co/north/t5_small_NCC_lm)|[🤗](https://huggingface.co/north/t5_base_NCC_lm)|✔|[🤗](https://huggingface.co/north/t5_xl_NCC_lm)|[🤗](https://huggingface.co/north/t5_xxl_NCC_lm)||
25
+
26
+ ## T5X Checkpoint
27
+ The original T5X checkpoint is also available for this model in the [Google Cloud Bucket](gs://north-t5x/pretrained_models/large/norwegian_NCC_plus_English_pluss100k_lm_t5x_large/).
28
 
 
 
 
 
 
 
 
 
29
 
30
  ## Performance
31
  A thorough evaluation of the North-T5 models is planned. I strongly recommend any external researchers to make their own evaluation. The main advantage with the T5-models are their flexibility. Traditionally, encoder-only models (like BERT) excels in classification tasks, while seq-2-seq models are easier to train for tasks like translation and Q&A. Despite this, here are the results from using North-T5 on the political classification task explained [here](https://arxiv.org/abs/2104.09617).
 
44
 
45
  This is preliminary results. The [results](https://arxiv.org/abs/2104.09617) from the BERT-models are based on the test-results from the best model after 10 runs with early stopping and a decaying learning rate. The T5-results are the average of five runs on the evaluation set. The small-model was trained for 10.000 steps, while the rest for 5.000 steps. A fixed learning rate was used (no decay), and no early stopping. Neither was the recommended rank classification used. We use a max sequence length of 512. This method simplifies the test setup and gives results that are easy to interpret. However, the results from the T5 model might actually be a bit sub-optimal.
46
 
47
+ ## Sub-versions of North-T5
48
+ The following sub-versions are available. Other versions will be available shorter.
49
+
50
+ |**Model** | **Description** |
51
+ |:-----------|:-------|
52
+ |**North&#8209;T5&#8209;NCC** |This is the main version. It is trained an additonal 500.000 steps on from the mT5 checkpoint. The training corpus is based on [the Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC). In addition there are added data from MC4 and English Wikipedia.|
53
+ |**North&#8209;T5&#8209;NCC&#8209;lm**|Pretrained for an addtional 100k steps on the LM objective discussed in the [T5 paper](https://arxiv.org/pdf/1910.10683.pdf). In a way this turns a masked language model into an autoregressive model. It also prepares the model for some tasks. When for instance doing translation and NLI, it is well documented that there is a clear benefit to do a step of unsupervised LM-training before starting the finetuning.|
54
 
 
 
 
 
 
 
55
 
56
  ## Fine-tuned versions
57
  As explained below, the model really needs to be fine-tuned for specific tasks. This procedure is simple, and the model is not very sensitive to the hyper-parameters used. Usually a decent result can be obtained by using a fixed learning rate of 1e-3. Smaller versions of the model typically needs to be trained for a longer time. It is easy to train the base-models in a Google Colab. I will provide an exampel Notebook on this soon.