edugp commited on
Commit
0af32aa
1 Parent(s): 4148005

Fix batch sizes

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -154,7 +154,7 @@ For `Random` sampling we trained with seq len 512 during the last 20 steps of th
154
 
155
  For `Gaussian` sampling we started a new optimizer after 230 steps with 128 sequence length, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for `Random` (512), a difference much larger than that of their respective -128 models (0.6520 for `Random`, 0.6608 for `Gaussian`).
156
 
157
- Batch size was 256 for training with 128 sequence length, and 48 for 512 sequence length, with no change in learning rate. Warmup steps for 512 was 500.
158
 
159
  ## Results
160
 
@@ -215,7 +215,7 @@ For simplicity, we will abbreviate the different models as follows:
215
  <figure>
216
 
217
  <caption>
218
- Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS and NER used max length 128 and batch size 128. Batch size for XNLI (length 256) is 256. All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs. Stepwise used an older checkpoint with only 180.000 steps.
219
  </caption>
220
 
221
  | Model | POS (F1/Acc) | NER (F1/Acc) | XNLI-256 (Acc) |
@@ -232,7 +232,7 @@ Table 3. Metrics for different downstream tasks, comparing our different models
232
 
233
  </figure>
234
 
235
- Table 4. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 128. Batch size for XNLI 128 for XNLI (length 512) All models were fine-tuned for 5 epochs. Results marked with * indicate a repetition. Stepwise checkpoint had 204.000 steps during these tests.
236
  </caption>
237
 
238
  | Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI (Acc) |
@@ -251,7 +251,7 @@ Table 4. Metrics for different downstream tasks, comparing our different models
251
 
252
  In addition to the tasks above, we also trained the beta model on the SQUAD dataset, achieving exact match 50.96 and F1 68.74 (sequence length 128). A full evaluation of this task is still pending.
253
 
254
- Results for PAWS-X seem surprising given the large differences in performance and the repeated 0.5765 baseline. However, this training was repeated and results seem consistent. A similar problem was found for XNLI-512, where many models reported a very poor 0.3333 accuracy on a first run (and even a second, in the case of BSC-BNE). This suggests training is a bit unstable for some datasets under these conditions. Increasing the number of epochs and batch size would be a natural attempt to fix this problem, however, this is not feasible within the project schedule. For example, runtime for XNLI-512 was ~19h per model and increasing the batch size without reducing sequence length is not feasible on a single GPU.
255
 
256
  ## Bias and ethics
257
 
 
154
 
155
  For `Gaussian` sampling we started a new optimizer after 230 steps with 128 sequence length, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for `Random` (512), a difference much larger than that of their respective -128 models (0.6520 for `Random`, 0.6608 for `Gaussian`).
156
 
157
+ Batch size was 2048 for training with 128 sequence length, and 384 for 512 sequence length, with no change in learning rate. Warmup steps for 512 was 500.
158
 
159
  ## Results
160
 
 
215
  <figure>
216
 
217
  <caption>
218
+ Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS and NER used max length 128 and batch size 16. Batch size for XNLI is 32 (max length 256). All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs. Stepwise used an older checkpoint with only 180.000 steps.
219
  </caption>
220
 
221
  | Model | POS (F1/Acc) | NER (F1/Acc) | XNLI-256 (Acc) |
 
232
 
233
  </figure>
234
 
235
+ Table 4. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 16. Batch size for XNLI is 16 too (max length 512). All models were fine-tuned for 5 epochs. Results marked with * indicate a repetition. Stepwise checkpoint had 204.000 steps during these tests.
236
  </caption>
237
 
238
  | Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI (Acc) |
 
251
 
252
  In addition to the tasks above, we also trained the beta model on the SQUAD dataset, achieving exact match 50.96 and F1 68.74 (sequence length 128). A full evaluation of this task is still pending.
253
 
254
+ Results for PAWS-X seem surprising given the large differences in performance and the repeated 0.5765 baseline. However, this training was repeated and results seem consistent. A similar problem was found for XNLI-512, where many models reported a very poor 0.3333 accuracy on a first run (and even a second, in the case of BSC-BNE). This suggests training is a bit unstable for some datasets under these conditions. Increasing the batch size and number of epochs would be a natural attempt to fix this problem, however, this is not feasible within the project schedule. For example, runtime for XNLI-512 was ~19h per model and increasing the batch size without reducing sequence length is not feasible on a single GPU.
255
 
256
  ## Bias and ethics
257