Pablogps commited on
Commit
3b28ee3
1 Parent(s): 6257c22

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -46
README.md CHANGED
@@ -132,6 +132,9 @@ control sample.</caption>
132
  <caption>Figure 6. Experimental perplexity distribution of the sampled `mc4-es` after applying `Random` sampling.</caption>
133
  </figure>
134
 
 
 
 
135
  We then used the same setup as Liu et al. (2019) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` trained for the 250k steps, while `Random` was stopped at 230k and `Stepwise` at 180k (this was a decision based on an analysis of training performance and the computational resources available at the time).
136
 
137
  Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
@@ -204,72 +207,50 @@ For simplicity, we will abbreviate the different models as follows:
204
  <figure>
205
 
206
  <caption>
207
- Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS nad NER is CoNLL 2002. POS, NER adn PAWS-X used max length 512 and batch size 8.
208
  </caption>
209
 
210
  | Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI-256 (Acc) | XNLI-512 (Acc) |
211
- |--------------|-------------------------|----------------------|--------------|--------------|--------------|
212
- | BERT-m | 0.9629 / 0.9687 | 0.8539 / 0.9779 | 0.5765 | | |
213
- | BERT-wwm | 0.9642 / 0.9700 | 0.8579 / 0.9783 | 0.8720 | | |
214
- | BSC-BNE | 0.9659 / 0.9707 | 0.8700 / 0.9807 | 0.5765 | | |
215
- | Beta | 0.9638 / 0.9690 | 0.8725 / 0.9812 | 0.5765 | | |
216
- | Random | 0.9656 / 0.9704 | 0.8704 / 0.9807 | 0.8800 | | |
217
- | Stepwise | 0.9656 / 0.9707 | 0.8705 / 0.9809 | 0.8825 | | |
218
- | Gaussian | 0.9662 / 0.9709 | **0.8792 / 0.9816** | 0.8875 | | |
219
- | Random-512 | 0.9660 / 0.9707 | 0.8616 / 0.9803 | 0.6735 | | |
220
- | Gaussian-512 | **0.9662 / 0.9714** | **0.8764 / 0.9819** | **0.8965** | | |
221
 
222
  </figure>
223
 
224
  In addition to the tasks above, we also trained the beta model on the SQUAD dataset, achieving exact match 50.96 and F1 68.74 (sequence length 128). A full evaluation of this task is still pending.
225
 
226
- To note: not intense tuning, epochs, etc. Still, good?? PAWS-X: weird (large differences and repeated base value). Repeated and same, with minor differences.
227
 
228
- ### XNLI
229
 
230
- <figure>
231
-
232
- <caption>Table 6. Results for XNLI with sequence length 256 and batch size 32.</caption>
233
-
234
- | Model | Accuracy |
235
- |----------------------------------------------------|----------|
236
- | bert-base-multilingual-cased | 0.7852 |
237
- | dccuchile/bert-base-spanish-wwm-cased | **0.8186** |
238
- | BSC-TeMU/roberta-base-bne | 0.8178 |
239
- | bertin-project/bertin-base-random | 0.7745 |
240
- | bertin-project/bertin-base-stepwise | 0.7820 |
241
- | bertin-project/bertin-base-gaussian | 0.7942 |
242
- | bertin-project/bertin-base-random-exp-512seqlen | 0.7723 |
243
- | bertin-project/bertin-base-gaussian-exp-512seqlen | 0.7878 |
244
 
245
- </figure>
246
 
 
247
 
248
- <figure>
249
-
250
- <caption>Table 7. Results for XNLI with sequence length 512 and batch size 16.</caption>
251
 
252
- | Model | Accuracy |
253
- |----------------------------------------------------|----------|
254
- | bert-base-multilingual-cased | WIP |
255
- | dccuchile/bert-base-spanish-wwm-cased | WIP |
256
- | BSC-TeMU/roberta-base-bne | WIP |
257
- | bertin-project/bertin-base-random | WIP |
258
- | bertin-project/bertin-base-stepwise | WIP |
259
- | bertin-project/bertin-base-gaussian | WIP |
260
- | bertin-project/bertin-base-random-exp-512seqlen | 0.7799 |
261
- | bertin-project/bertin-base-gaussian-exp-512seqlen | 0.7843 |
262
 
263
- </figure>
264
 
265
  # Conclusions
266
 
267
- With roughly 10 days worth of access to 3xTPUv3-8, we have achieved remarkable results surpassing previous state of the art in a few tasks, and even improving document classification on models trained in massive supercomputers with very large—private—and highly curated datasets.
 
 
268
 
269
- The experience has been incredible and we feel this kind of events provide an amazing opportunity for small teams on low or non-existent budgets to learn how the big players in the field pre-train their models. The trade-off between learning and experimenting, and being beta-testers of libraries (Flax/JAX) and infrastructure (TPU VMs) is a marginal cost to pay compared to the benefits such access has to offer.
270
 
271
- We hope our work will set the basis for more small teams playing and
272
- experimenting with language models training on smaller subsets of huge datasets with reduced training times, since the performance of our models is on par with those trained on big machines for longer times.
273
 
274
  ## Team members
275
 
132
  <caption>Figure 6. Experimental perplexity distribution of the sampled `mc4-es` after applying `Random` sampling.</caption>
133
  </figure>
134
 
135
+
136
+ ### Training details
137
+
138
  We then used the same setup as Liu et al. (2019) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` trained for the 250k steps, while `Random` was stopped at 230k and `Stepwise` at 180k (this was a decision based on an analysis of training performance and the computational resources available at the time).
139
 
140
  Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
207
  <figure>
208
 
209
  <caption>
210
+ Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 8. Batch size for XNLI (length 256) is 32, while we needed to use 16 for XNLI (length 512).
211
  </caption>
212
 
213
  | Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI-256 (Acc) | XNLI-512 (Acc) |
214
+ |--------------|-------------------------|----------------------|--------------|-----------------|--------------|
215
+ | BERT-m | 0.9629 / 0.9687 | 0.8539 / 0.9779 | 0.5765 | 0.7852 | |
216
+ | BERT-wwm | 0.9642 / 0.9700 | 0.8579 / 0.9783 | 0.8720 | **0.8186** | |
217
+ | BSC-BNE | 0.9659 / 0.9707 | 0.8700 / 0.9807 | 0.5765 | 0.8178 | |
218
+ | Beta | 0.9638 / 0.9690 | 0.8725 / 0.9812 | 0.5765 | | 0.3333 |
219
+ | Random | 0.9656 / 0.9704 | 0.8704 / 0.9807 | 0.8800 | 0.7745 | 0.7795 |
220
+ | Stepwise | 0.9656 / 0.9707 | 0.8705 / 0.9809 | 0.8825 | 0.7820 | 0.7799 |
221
+ | Gaussian | 0.9662 / 0.9709 | **0.8792 / 0.9816** | 0.8875 | 0.7942 | 0.7843 |
222
+ | Random-512 | 0.9660 / 0.9707 | 0.8616 / 0.9803 | 0.6735 | 0.7723 | 0.7799 |
223
+ | Gaussian-512 | **0.9662 / 0.9714** | **0.8764 / 0.9819** | **0.8965** | 0.7878 | 0.7843 |
224
 
225
  </figure>
226
 
227
  In addition to the tasks above, we also trained the beta model on the SQUAD dataset, achieving exact match 50.96 and F1 68.74 (sequence length 128). A full evaluation of this task is still pending.
228
 
229
+ To note: not intense tuning, epochs, etc. Still, good?? PAWS-X: weird (large differences and repeated base value). Repeated and same, with minor differences.Sometimes too short training? XNLI-512, runtime ~19h per model.
230
 
231
+ ## Bias and ethics
232
 
233
+ Bananas
 
 
 
 
 
 
 
 
 
 
 
 
 
234
 
235
+ ## Analysis
236
 
237
+ The performance of our models has been, in general, very good. Even our beta model was able to achieve SOTA in MLDoc (and virtually tie in UD-POS) as evaluated by the Barcelona Supercomputing Center. In the main masked-language task our models reach values between 0.65 and 0.69, which foretells good results for downstream tasks.
238
 
239
+ Our analysis of downstream tasks is not yet complete. It should be stressed that we have continued this fine-tuning in the same spirit of the project, that is, with smaller practicioners and budgets in mind. Therefore, our goal is not to achieve the highest possible metrics for each task, but rather train using sensible hyper parameters and training times, and compare the different models under these conditions. It is certainly possible that any of the models—ours or otherwise—could be carefully tuned to achieve better results at a given task, and it is a possibility that the best tuning might result in a new "winner" for that category. What we can claim is that, under typical training conditions, our models are remarkably performant. In particular, Gaussian-512 is clearly superior, taking the lead in three of the four tasks analysed.
 
 
240
 
241
+ The differences in performance for models trained using different data-sampling techniques are consistent. Gaussian-sampling is always first, while Stepwise is only marginally better than Random. This proves that the sampling technique is, indeed, relevant.
 
 
 
 
 
 
 
 
 
242
 
243
+ As already mentiond in the Training details section, the methodology used to extend sequence length during training is critical. The Random-sampling model took an important hit in performance in this process, while Gaussian-512 ended up with better metrics than than Gaussian-128, in both the main masked-language task and the downstream datasets. The key difference was that Random kept the optimizer intact while Gaussian used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.
244
 
245
  # Conclusions
246
 
247
+ With roughly 10 days worth of access to 3xTPUv3-8, we have achieved remarkable results surpassing previous state of the art in a few tasks, and even improving document classification on models trained in massive supercomputers with very large—private—and highly-curated datasets.
248
+
249
+ The very big size of the datasets available looked enticing while formulating the project, however, it soon proved to be an important challenge given time constraints. This lead to a debate within the team and ended up reshaping our project and goals, now focusing on analysing this problem and how we could improve this situation for smaller teams like ours in the future. The subsampling techniques analysed in this report have shown great promise in this regard, and we hope to see other groups use them and improve them in the future.
250
 
251
+ At a personal leve, we agree that the experience has been incredible, and we feel this kind of events provide an amazing opportunity for small teams on low or non-existent budgets to learn how the big players in the field pre-train their models, certainly stirring the research community. The trade-off between learning and experimenting, and being beta-testers of libraries (Flax/JAX) and infrastructure (TPU VMs) is a marginal cost to pay compared to the benefits such access has to offer.
252
 
253
+ Given our good results, on par with those of large corporations, we hope our work will inspire and set the basis for more small teams to play and experiment with language models on smaller subsets of huge datasets.
 
254
 
255
  ## Team members
256