bertin-project
/

bertin-roberta-base-spanish

@@ -213,7 +213,7 @@ For simplicity, we will abbreviate the different models as follows:
 <figure>
 <caption>
-Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS and NER used max length 128 and batch size 128. Batch size for XNLI (length 256) is 256. All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs.
 </caption>
 |     Model    | POS (F1/Acc)         |     NER (F1/Acc)    | XNLI-256 (Acc) |
@@ -230,7 +230,7 @@ Table 3. Metrics for different downstream tasks, comparing our different models
 </figure>
-Table 4. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 128. Batch size for XNLI 128 for XNLI (length 512) All models were fine-tuned for 5 epochs. Results marked with * indicate a repetition.
 </caption>
 |     Model    | POS (F1/Acc)         |     NER (F1/Acc)    | PAWS-X (Acc) | XNLI (Acc) |
@@ -394,9 +394,9 @@ Geographical bias
 The performance of our models has been, in general, very good. Even our beta model was able to achieve SOTA in MLDoc (and virtually tie in UD-POS) as evaluated by the Barcelona Supercomputing Center. In the main masked-language task our models reach values between 0.65 and 0.69, which foretells good results for downstream tasks.
-Our analysis of downstream tasks is not yet complete. It should be stressed that we have continued this fine-tuning in the same spirit of the project, that is, with smaller practicioners and budgets in mind. Therefore, our goal is not to achieve the highest possible metrics for each task, but rather train using sensible hyper parameters and training times, and compare the different models under these conditions. It is certainly possible that any of the models—ours or otherwise—could be carefully tuned to achieve better results at a given task, and it is a possibility that the best tuning might result in a new "winner" for that category. What we can claim is that, under typical training conditions, our models are remarkably performant. In particular, Gaussian-512 is clearly superior, taking the lead in three of the four tasks analysed.
-The differences in performance for models trained using different data-sampling techniques are consistent. Gaussian-sampling is always first, while Stepwise is only marginally better than Random. This proves that the sampling technique is, indeed, relevant.
 As already mentiond in the Training details section, the methodology used to extend sequence length during training is critical. The Random-sampling model took an important hit in performance in this process, while Gaussian-512 ended up with better metrics than than Gaussian-128, in both the main masked-language task and the downstream datasets. The key difference was that Random kept the optimizer intact while Gaussian used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.

 <figure>
 <caption>
+Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS and NER used max length 128 and batch size 128. Batch size for XNLI (length 256) is 256. All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs. Stepwise used an older checkpoint with only 180.000 steps.
 </caption>
 |     Model    | POS (F1/Acc)         |     NER (F1/Acc)    | XNLI-256 (Acc) |
 </figure>
+Table 4. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 128. Batch size for XNLI 128 for XNLI (length 512) All models were fine-tuned for 5 epochs. Results marked with * indicate a repetition. Stepwise checkpoint had 204.000 steps during these tests.
 </caption>
 |     Model    | POS (F1/Acc)         |     NER (F1/Acc)    | PAWS-X (Acc) | XNLI (Acc) |
 The performance of our models has been, in general, very good. Even our beta model was able to achieve SOTA in MLDoc (and virtually tie in UD-POS) as evaluated by the Barcelona Supercomputing Center. In the main masked-language task our models reach values between 0.65 and 0.69, which foretells good results for downstream tasks.
+Our analysis of downstream tasks is not yet complete. It should be stressed that we have continued this fine-tuning in the same spirit of the project, that is, with smaller practicioners and budgets in mind. Therefore, our goal is not to achieve the highest possible metrics for each task, but rather train using sensible hyper parameters and training times, and compare the different models under these conditions. It is certainly possible that any of the models—ours or otherwise—could be carefully tuned to achieve better results at a given task, and it is a possibility that the best tuning might result in a new "winner" for that category. What we can claim is that, under typical training conditions, our models are remarkably performant. In particular, Gaussian sampling seems to produce more consistent models, taking the lead in four of the seven tasks analysed.
+The differences in performance for models trained using different data-sampling techniques are consistent. Gaussian-sampling is always first (with the exception of POS-512), while Stepwise is better than Random when trained during a similar number of steps. This proves that the sampling technique is, indeed, relevant.
 As already mentiond in the Training details section, the methodology used to extend sequence length during training is critical. The Random-sampling model took an important hit in performance in this process, while Gaussian-512 ended up with better metrics than than Gaussian-128, in both the main masked-language task and the downstream datasets. The key difference was that Random kept the optimizer intact while Gaussian used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.