yhavinga commited on
Commit
bae580e
1 Parent(s): 4880af2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +168 -1
README.md CHANGED
@@ -1 +1,168 @@
1
- Wandb run https://wandb.ai/yepster/ul2-large-de-neddx2-en-nl/runs/s3z13day?workspace=user-yepster
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - nl
4
+ - en
5
+ - multilingual
6
+ license: apache-2.0
7
+ tags:
8
+ - dutch
9
+ - english
10
+ - t5
11
+ - t5x
12
+ - ul2
13
+ - seq2seq
14
+ - translation
15
+ datasets:
16
+ - yhavinga/mc4_nl_cleaned
17
+ - yhavinga/nedd_wiki_news
18
+ pipeline_tag: translation
19
+ widget:
20
+ - text: >-
21
+ Redistricting and West Virginia’s shrinking population forced the state’s
22
+ Republican Legislature to pit Mr. McKinley, a six-term Republican with a
23
+ pragmatic bent, against Mr. Mooney, who has served four terms marked more
24
+ by conservative rhetoric than legislative achievements.
25
+ - text: >-
26
+ It is a painful and tragic spectacle that rises before me: I have drawn
27
+ back the curtain from the rottenness of man. This word, in my mouth, is at
28
+ least free from one suspicion: that it involves a moral accusation against
29
+ humanity.
30
+ - text: >-
31
+ Young Wehling was hunched in his chair, his head in his hand. He was so
32
+ rumpled, so still and colorless as to be virtually invisible. His
33
+ camouflage was perfect, since the waiting room had a disorderly and
34
+ demoralized air, too. Chairs and ashtrays had been moved away from the
35
+ walls. The floor was paved with spattered dropcloths.
36
+ ---
37
+
38
+ # ul2-large-en-nl for English to Dutch translation
39
+
40
+ Fine-tuned T5 model on English to Dutch translation that was pretrained on Dutch using a UL2 (Mixture-of-Denoisers) objective.
41
+ The T5 model was introduced in
42
+ [this paper](https://arxiv.org/abs/1910.10683)
43
+ and first released at [this page](https://github.com/google-research/text-to-text-transfer-transformer).
44
+ The UL2 objective was introduced in
45
+ [this paper](https://arxiv.org/abs/2205.05131)
46
+ and first released at [this page](https://github.com/google-research/google-research/tree/master/ul2).
47
+
48
+
49
+
50
+ ## Model description
51
+
52
+ T5 is an encoder-decoder model and treats all NLP problems in a text-to-text format.
53
+
54
+ `ul2-large-en-nl-v2` T5 is a transformers model fine-tuned on parallel sentence and paragraph pairs
55
+ sampled from books.
56
+
57
+ This model used the [T5 v1.1](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) improvements compared to the original T5 model during the pretraining:
58
+ - GEGLU activation in the feed-forward hidden layer, rather than ReLU - see [here](https://arxiv.org/abs/2002.05202)
59
+ - Dropout was turned off during pre-training. Dropout should be re-enabled during fine-tuning
60
+ - Pre-trained on self-supervised objective only without mixing in the downstream tasks
61
+ - No parameter sharing between embedding and classifier layer
62
+
63
+
64
+ ### UL2 pretraining objective
65
+
66
+ This model was pretrained with the UL2's Mixture-of-Denoisers (MoD) objective, that combines diverse pre-training
67
+ paradigms together. UL2 frames different objective functions for training language models as denoising tasks, where
68
+ the model has to recover missing sub-sequences of a given input. During pre-training it uses a novel mixture-of-denoisers
69
+ that samples from a varied set of such objectives, each with different configurations. UL2 is trained using a mixture of
70
+ three denoising tasks:
71
+
72
+ 1. R-denoising (or regular span corruption), which emulates the standard T5 span corruption objective;
73
+ 2. X-denoising (or extreme span corruption); and
74
+ 3. S-denoising (or sequential PrefixLM).
75
+
76
+ During pre-training, we sample from the available denoising tasks based on user-specified ratios.
77
+ UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training
78
+ denoising task. During the pre-training, a paradigm token is inserted to the input
79
+ (`[NLU]` for R-denoising, `[NLG]` for X-denoising, or `[S2S]` for S-denoising) indicating the denoising task at hand.
80
+ Then, during fine-tuning the same input token should be inserted to get the best performance for different downstream
81
+ fine-tuning tasks.
82
+
83
+ ## Intended uses & limitations
84
+
85
+ This model was fine-tuned on parallel sentence and paragraph pairs and can be used
86
+ for machine translation.
87
+
88
+ ### How to use
89
+
90
+ Here is how to use this model in PyTorch:
91
+
92
+ ```python
93
+ model_name = "yhavinga/ul2-large-en-nl-v2"
94
+ from transformers import AutoTokenizer
95
+ from transformers import AutoModelForSeq2SeqLM
96
+ from transformers import pipeline
97
+ import torch
98
+ device_num = 0 if torch.cuda.is_available() else -1
99
+ device = "cpu" if device_num < 0 else f"cuda:{device_num}"
100
+
101
+ tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
102
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_name, use_auth_token=True).to(
103
+ device
104
+ )
105
+ params = {"max_length": 370, "num_beams": 4, "early_stopping": True}
106
+ translator = pipeline("translation", tokenizer=tokenizer, model=model, device=device_num)
107
+ print(translator("Young Wehling was hunched in his chair, his head in his hand. He was so rumpled, so still and colorless as to be virtually invisible.",
108
+ **params)[0]['translation_text'])
109
+ ```
110
+
111
+
112
+ ### Limitations and bias
113
+
114
+ The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral.
115
+ Therefore, the model can have biased predictions. This bias will also affect all fine-tuned versions of this model.
116
+
117
+ ## Training data
118
+
119
+ The `ul2-large-en-nl` T5 model was pre-trained simultaneously on a combination of several datasets,
120
+ including the `full` config of the "mc4_nl_cleaned" dataset, which is a cleaned version of Common Crawl's web
121
+ crawl corpus, Dutch books, the Dutch subset of Wikipedia (2022-03-20), and a subset of "mc4_nl_cleaned"
122
+ containing only texts from Dutch newspapers.
123
+
124
+ After pre-training, the model was
125
+ fine-tuned on a translation dataset containing 13 million sentence and paragraph pairs
126
+ sampled from books.
127
+
128
+ ## Training procedure
129
+
130
+ ### Preprocessing
131
+
132
+ The ul2-large-en-nl T5 model uses a SentencePiece unigram tokenizer with a vocabulary of 32,000 tokens.
133
+ The tokenizer includes the special tokens `<pad>`, `</s>`, `<unk>`, known from the original T5 paper,
134
+ `[NLU]`, `[NLG]` and `[S2S]` for the MoD pre-training, and `<n>` for newline.
135
+ During pre-training with the UL2 objective, input and output sequences consist of 512 consecutive tokens.
136
+ The tokenizer does not lowercase texts and is therefore case-sensitive; it distinguises
137
+ between `dutch` and `Dutch`.
138
+ Additionally, 100+28 extra tokens were added for pre-training tasks, resulting in a total of 32,128 tokens.
139
+
140
+ ### Fine-tuning
141
+
142
+ This model was fine-tuned on a dataset containing 13M sentence and paragraph translation pairs sampled from books.
143
+ Wandb run https://wandb.ai/yepster/ul2-large-de-neddx2-en-nl/runs/s3z13day?workspace=user-yepster
144
+
145
+ * Pre-trained model used as starting point: yhavinga/ul2-large-dutch-english (3150k checkpoint)
146
+
147
+ The first three epochs were trained using the T5x framework, with a batch size of 128, a constant learning rate of 0.001. This process spanned from step 3150k to 3440k.
148
+ For the concluding epoch, a HuggingFace Flax based trainer was used with the following settings:
149
+
150
+ - **Batch Size**: Total effective batch size of 512, achieved via per-device settings and gradient accumulation.
151
+ - **Learning Rate**: Set at 0.0002, utilizing cosine scheduling.
152
+ - **Optimizer**: AdamW with beta1=0.9, beta2=0.997, epsilon=1e-8.
153
+ - **Weight Decay**: Configured to 0.001 for regularization.
154
+ - **Additional Parameters**: Dropout rate of 0.01, label smoothing factor of 0.11, and sequence length of 370 tokens. Model datatype is bfloat16, z_loss at 0.0001.
155
+
156
+ ## Evaluation results
157
+
158
+ TBD
159
+
160
+ ## Acknowledgements
161
+
162
+ This project would not have been possible without compute generously provided by Google through the
163
+ [TPU Research Cloud](https://sites.research.google/trc/).
164
+ Thanks to the [Finnish-NLP](https://huggingface.co/Finnish-NLP) authors for releasing their code for the UL2 objective and associated task definitions.
165
+ Thanks to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for helping me get started with the t5x framework.
166
+
167
+ Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
168
+