patrickvonplaten commited on
Commit
0198c1c
1 Parent(s): e000016

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -17
README.md CHANGED
@@ -23,35 +23,82 @@ Paper: [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1
23
 
24
  Authors: *Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler*
25
 
26
- # PreTraining
27
 
28
- The model is pretrained on the C4 corpus. A batch size of 1024 is used for pretraining this model.
29
- The model is trained on a total of 1 trillion tokens on C4 (2 million steps). The sequence length is set to 512/512 for inputs and targets.
30
- Dropout is set to 0 during pretraining. Pre-training took approximately slight more than one month for about 1 trillion
31
- tokens. We use the same mixture of denoisers as earlier sections. The model has 32 encoder layers and
32
- 32 decoder layers, dmodel of 4096 and df f of 16384. The dimension of each head is 256 for a total
33
- of 16 heads. Our model uses a model parallelism of 8. We retain the [same sentencepiece tokenizer as T5 of 32k vocab size].
34
- Hence, UL20B can be interpreted as a model that is quite similar to T5 but trained with a different objective and slightly different scaling knobs.
35
- Similar to earlier experiments, **UL20B** is trained with Jax and T5X infrastructure.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  ## Fine-tuning
38
 
39
  The model was continously fine-tuned after N pretraining steps where N is typically from 50k to 100k.
40
- In other words, after each Nk steps of pretraining, we finetune on each downstream task and record its results. This is generally done in a manual fashion.
41
- While some tasks were finetuned on earlier pretrained checkpoints as the model was still pretraining, many were finetuned on checkpoints nearer
42
- to convergence that we release.
43
- As we continiously finetune, we stop finetuning on a task once it has reached sota to save compute.
44
- In total, the model was trained for 2.65 million steps where as
45
 
46
- **Important**: For more details, please see sections 5.2.1 and 5.2.2 of the paper.
47
 
48
  ## Contribution
49
 
50
- This model was contributed by [Daniel Hesslow](https://huggingface.co/Seledorn)
51
 
52
  ## Examples
53
 
54
- Note that the model has been fine-tuned
 
 
55
 
56
  ```python
57
  from transformers import T5ForConditionalGeneration, AutoTokenizer
23
 
24
  Authors: *Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler*
25
 
26
+ # Training
27
 
28
+ The checkpoint was iteratively pre-trained on C4 and fine-tuned on a variety of datasets
29
+
30
+ ## PreTraining
31
+
32
+ The model is pretrained on the C4 corpus. For pretraining, the model is trained on a total of 1 trillion tokens on C4 (2 million steps)
33
+ with a batch size of 1024. The sequence length is set to 512/512 for inputs and targets.
34
+ Dropout is set to 0 during pretraining. Pre-training took slightly more than one month for about 1 trillion
35
+ tokens. The model has 32 encoder layers and 32 decoder layers, `dmodel` of 4096 and `df` of 16384.
36
+ The dimension of each head is 256 for a total of 16 heads. Our model uses a model parallelism of 8.
37
+ The same same sentencepiece tokenizer as T5 of vocab size 32000 is used (click [here](https://huggingface.co/docs/transformers/v4.20.0/en/model_doc/t5#transformers.T5Tokenizer) for more information about the T5 tokenizer).
38
+
39
+ UL-20B can be interpreted as a model that is quite similar to T5 but trained with a different objective and slightly different scaling knobs.
40
+ UL-20B was trained using the [Jax](https://github.com/google/jax) and [T5X](https://github.com/google-research/t5x) infrastructure.
41
+
42
+ The training objective during pretraining is a mixture of different denoising strategies that are explained in the following:
43
+
44
+ ## Mixture of Denoisers
45
+
46
+ To quote the paper:
47
+ > We conjecture that a strong universal model has to be exposed to solving diverse set of problems
48
+ > during pre-training. Given that pre-training is done using self-supervision, we argue that such diversity
49
+ > should be injected to the objective of the model, otherwise the model might suffer from lack a certain
50
+ > ability, like long-coherent text generation.
51
+ > Motivated by this, as well as current class of objective functions, we define three main paradigms that
52
+ > are used during pre-training:
53
+
54
+ - **R-Denoiser**: The regular denoising is the standard span corruption introduced in [T5](https://huggingface.co/docs/transformers/v4.20.0/en/model_doc/t5)
55
+ that uses a range of 2 to 5 tokens as the span length, which masks about 15% of
56
+ input tokens. These spans are short and potentially useful to acquire knowledge instead of
57
+ learning to generate fluent text.
58
+
59
+ - **S-Denoiser**: A specific case of denoising where we observe a strict sequential order when
60
+ framing the inputs-to-targets task, i.e., prefix language modeling. To do so, we simply
61
+ partition the input sequence into two sub-sequences of tokens as context and target such that
62
+ the targets do not rely on future information. This is unlike standard span corruption where
63
+ there could be a target token with earlier position than a context token. Note that similar to
64
+ the Prefix-LM setup, the context (prefix) retains a bidirectional receptive field. We note that
65
+ S-Denoising with very short memory or no memory is in similar spirit to standard causal
66
+ language modeling.
67
+
68
+ - **X-Denoiser**: An extreme version of denoising where the model must recover a large part
69
+ of the input, given a small to moderate part of it. This simulates a situation where a model
70
+ needs to generate long target from a memory with relatively limited information. To do
71
+ so, we opt to include examples with aggressive denoising where approximately 50% of the
72
+ input sequence is masked. This is by increasing the span length and/or corruption rate. We
73
+ consider a pre-training task to be extreme if it has a long span (e.g., ≥ 12 tokens) or have
74
+ a large corruption rate (e.g., ≥ 30%). X-denoising is motivated by being an interpolation
75
+ between regular span corruption and language model like objectives.
76
+
77
+ See the following diagram for a more visual explanation:
78
+
79
+ ![mixture-of-denoisers](https://raw.githubusercontent.com/google-research/google-research/master/ul2/figs/mod.png)
80
+
81
+ **Important**: For more details, please see sections 3.1.2 of the [paper](https://arxiv.org/pdf/2205.05131v1.pdf).
82
 
83
  ## Fine-tuning
84
 
85
  The model was continously fine-tuned after N pretraining steps where N is typically from 50k to 100k.
86
+ In other words, after each Nk steps of pretraining, the model is finetuned on each downstream task. See section 5.2.2 of [paper](https://arxiv.org/pdf/2205.05131v1.pdf) to get an overview of all datasets that were used for fine-tuning).
87
+
88
+ As the model is continuously finetuned, finetuning is stopped on a task once it has reached state-of-the-art to save compute.
89
+ In total, the model was trained for 2.65 million steps.
 
90
 
91
+ **Important**: For more details, please see sections 5.2.1 and 5.2.2 of the [paper](https://arxiv.org/pdf/2205.05131v1.pdf).
92
 
93
  ## Contribution
94
 
95
+ This model was contributed by [Daniel Hesslow](https://huggingface.co/Seledorn).
96
 
97
  ## Examples
98
 
99
+ The following shows how one can predict masked passages using the different denoising strategies.
100
+
101
+
102
 
103
  ```python
104
  from transformers import T5ForConditionalGeneration, AutoTokenizer