Pablogps commited on
Commit
49b6c59
1 Parent(s): 73f4dbe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -213,7 +213,7 @@ For simplicity, we will abbreviate the different models as follows:
213
  <figure>
214
 
215
  <caption>
216
- Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS and NER used max length 128 and batch size 128. Batch size for XNLI (length 256) is 256. All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs.
217
  </caption>
218
 
219
  | Model | POS (F1/Acc) | NER (F1/Acc) | XNLI-256 (Acc) |
@@ -230,7 +230,7 @@ Table 3. Metrics for different downstream tasks, comparing our different models
230
 
231
  </figure>
232
 
233
- Table 4. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 128. Batch size for XNLI 128 for XNLI (length 512) All models were fine-tuned for 5 epochs. Results marked with * indicate a repetition.
234
  </caption>
235
 
236
  | Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI (Acc) |
@@ -394,9 +394,9 @@ Geographical bias
394
 
395
  The performance of our models has been, in general, very good. Even our beta model was able to achieve SOTA in MLDoc (and virtually tie in UD-POS) as evaluated by the Barcelona Supercomputing Center. In the main masked-language task our models reach values between 0.65 and 0.69, which foretells good results for downstream tasks.
396
 
397
- Our analysis of downstream tasks is not yet complete. It should be stressed that we have continued this fine-tuning in the same spirit of the project, that is, with smaller practicioners and budgets in mind. Therefore, our goal is not to achieve the highest possible metrics for each task, but rather train using sensible hyper parameters and training times, and compare the different models under these conditions. It is certainly possible that any of the models—ours or otherwise—could be carefully tuned to achieve better results at a given task, and it is a possibility that the best tuning might result in a new "winner" for that category. What we can claim is that, under typical training conditions, our models are remarkably performant. In particular, Gaussian-512 is clearly superior, taking the lead in three of the four tasks analysed.
398
 
399
- The differences in performance for models trained using different data-sampling techniques are consistent. Gaussian-sampling is always first, while Stepwise is only marginally better than Random. This proves that the sampling technique is, indeed, relevant.
400
 
401
  As already mentiond in the Training details section, the methodology used to extend sequence length during training is critical. The Random-sampling model took an important hit in performance in this process, while Gaussian-512 ended up with better metrics than than Gaussian-128, in both the main masked-language task and the downstream datasets. The key difference was that Random kept the optimizer intact while Gaussian used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.
402
 
213
  <figure>
214
 
215
  <caption>
216
+ Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS and NER used max length 128 and batch size 128. Batch size for XNLI (length 256) is 256. All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs. Stepwise used an older checkpoint with only 180.000 steps.
217
  </caption>
218
 
219
  | Model | POS (F1/Acc) | NER (F1/Acc) | XNLI-256 (Acc) |
230
 
231
  </figure>
232
 
233
+ Table 4. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 128. Batch size for XNLI 128 for XNLI (length 512) All models were fine-tuned for 5 epochs. Results marked with * indicate a repetition. Stepwise checkpoint had 204.000 steps during these tests.
234
  </caption>
235
 
236
  | Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI (Acc) |
394
 
395
  The performance of our models has been, in general, very good. Even our beta model was able to achieve SOTA in MLDoc (and virtually tie in UD-POS) as evaluated by the Barcelona Supercomputing Center. In the main masked-language task our models reach values between 0.65 and 0.69, which foretells good results for downstream tasks.
396
 
397
+ Our analysis of downstream tasks is not yet complete. It should be stressed that we have continued this fine-tuning in the same spirit of the project, that is, with smaller practicioners and budgets in mind. Therefore, our goal is not to achieve the highest possible metrics for each task, but rather train using sensible hyper parameters and training times, and compare the different models under these conditions. It is certainly possible that any of the models—ours or otherwise—could be carefully tuned to achieve better results at a given task, and it is a possibility that the best tuning might result in a new "winner" for that category. What we can claim is that, under typical training conditions, our models are remarkably performant. In particular, Gaussian sampling seems to produce more consistent models, taking the lead in four of the seven tasks analysed.
398
 
399
+ The differences in performance for models trained using different data-sampling techniques are consistent. Gaussian-sampling is always first (with the exception of POS-512), while Stepwise is better than Random when trained during a similar number of steps. This proves that the sampling technique is, indeed, relevant.
400
 
401
  As already mentiond in the Training details section, the methodology used to extend sequence length during training is critical. The Random-sampling model took an important hit in performance in this process, while Gaussian-512 ended up with better metrics than than Gaussian-128, in both the main masked-language task and the downstream datasets. The key difference was that Random kept the optimizer intact while Gaussian used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.
402