google
/

ul2

@@ -23,35 +23,82 @@ Paper: [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1
 Authors: *Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler*
-# PreTraining
-The model is pretrained on the C4 corpus. A batch size of 1024 is used for pretraining this model.
-The model is trained on a total of 1 trillion tokens on C4 (2 million steps). The sequence length is set to 512/512 for inputs and targets.
-Dropout is set to 0 during pretraining. Pre-training took approximately slight more than one month for about 1 trillion
-tokens. We use the same mixture of denoisers as earlier sections. The model has 32 encoder layers and
-32 decoder layers, dmodel of 4096 and df f of 16384. The dimension of each head is 256 for a total
-of 16 heads. Our model uses a model parallelism of 8. We retain the [same sentencepiece tokenizer as T5 of 32k vocab size].
-Hence, UL20B can be interpreted as a model that is quite similar to T5 but trained with a different objective and slightly different scaling knobs.
-Similar to earlier experiments, **UL20B** is trained with Jax and T5X infrastructure.
 ## Fine-tuning
 The model was continously fine-tuned after N pretraining steps where N is typically from 50k to 100k.
-In other words, after each Nk steps of pretraining, we finetune on each downstream task and record its results. This is generally done in a manual fashion.
-While some tasks were finetuned on earlier pretrained checkpoints as the model was still pretraining, many were finetuned on checkpoints nearer
-to convergence that we release.
-As we continiously finetune, we stop finetuning on a task once it has reached sota to save compute.
-In total, the model was trained for 2.65 million steps where as
-**Important**: For more details, please see sections 5.2.1 and 5.2.2 of the paper.
 ## Contribution
-This model was contributed by [Daniel Hesslow](https://huggingface.co/Seledorn)
 ## Examples
-Note that the model has been fine-tuned
 ```python
 from transformers import T5ForConditionalGeneration, AutoTokenizer

 Authors: *Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler*
+# Training
+The checkpoint was iteratively pre-trained on C4 and fine-tuned on a variety of datasets
+## PreTraining
+The model is pretrained on the C4 corpus. For pretraining, the model is trained on a total of 1 trillion tokens on C4 (2 million steps)
+with a batch size of 1024. The sequence length is set to 512/512 for inputs and targets.
+Dropout is set to 0 during pretraining. Pre-training took slightly more than one month for about 1 trillion
+tokens. The model has 32 encoder layers and 32 decoder layers, `dmodel` of 4096 and `df` of 16384.
+The dimension of each head is 256 for a total of 16 heads. Our model uses a model parallelism of 8.
+The same same sentencepiece tokenizer as T5 of vocab size 32000 is used (click [here](https://huggingface.co/docs/transformers/v4.20.0/en/model_doc/t5#transformers.T5Tokenizer) for more information about the T5 tokenizer).
+UL-20B can be interpreted as a model that is quite similar to T5 but trained with a different objective and slightly different scaling knobs.
+UL-20B was trained using the [Jax](https://github.com/google/jax) and [T5X](https://github.com/google-research/t5x) infrastructure.
+The training objective during pretraining is a mixture of different denoising strategies that are explained in the following:
+## Mixture of Denoisers
+To quote the paper:
+> We conjecture that a strong universal model has to be exposed to solving diverse set of problems
+> during pre-training. Given that pre-training is done using self-supervision, we argue that such diversity
+> should be injected to the objective of the model, otherwise the model might suffer from lack a certain
+> ability, like long-coherent text generation.
+> Motivated by this, as well as current class of objective functions, we define three main paradigms that
+> are used during pre-training:
+- **R-Denoiser**: The regular denoising is the standard span corruption introduced in [T5](https://huggingface.co/docs/transformers/v4.20.0/en/model_doc/t5)
+ that uses a range of 2 to 5 tokens as the span length, which masks about 15% of
+input tokens. These spans are short and potentially useful to acquire knowledge instead of
+learning to generate fluent text.
+- **S-Denoiser**: A specific case of denoising where we observe a strict sequential order when
+framing the inputs-to-targets task, i.e., prefix language modeling. To do so, we simply
+partition the input sequence into two sub-sequences of tokens as context and target such that
+the targets do not rely on future information. This is unlike standard span corruption where
+there could be a target token with earlier position than a context token. Note that similar to
+the Prefix-LM setup, the context (prefix) retains a bidirectional receptive field. We note that
+S-Denoising with very short memory or no memory is in similar spirit to standard causal
+language modeling.
+- **X-Denoiser**: An extreme version of denoising where the model must recover a large part
+of the input, given a small to moderate part of it. This simulates a situation where a model
+needs to generate long target from a memory with relatively limited information. To do
+so, we opt to include examples with aggressive denoising where approximately 50% of the
+input sequence is masked. This is by increasing the span length and/or corruption rate. We
+consider a pre-training task to be extreme if it has a long span (e.g., ≥ 12 tokens) or have
+a large corruption rate (e.g., ≥ 30%). X-denoising is motivated by being an interpolation
+between regular span corruption and language model like objectives.
+See the following diagram for a more visual explanation:
+![mixture-of-denoisers](https://raw.githubusercontent.com/google-research/google-research/master/ul2/figs/mod.png)
+**Important**: For more details, please see sections 3.1.2 of the [paper](https://arxiv.org/pdf/2205.05131v1.pdf).
 ## Fine-tuning
 The model was continously fine-tuned after N pretraining steps where N is typically from 50k to 100k.
+In other words, after each Nk steps of pretraining, the model is finetuned on each downstream task. See section 5.2.2 of [paper](https://arxiv.org/pdf/2205.05131v1.pdf) to get an overview of all datasets that were used for fine-tuning).
+As the model is continuously finetuned, finetuning is stopped on a task once it has reached state-of-the-art to save compute.
+In total, the model was trained for 2.65 million steps.
+**Important**: For more details, please see sections 5.2.1 and 5.2.2 of the [paper](https://arxiv.org/pdf/2205.05131v1.pdf).
 ## Contribution
+This model was contributed by [Daniel Hesslow](https://huggingface.co/Seledorn).
 ## Examples
+The following shows how one can predict masked passages using the different denoising strategies.
 ```python
 from transformers import T5ForConditionalGeneration, AutoTokenizer