Pablogps commited on
Commit
33bc718
1 Parent(s): eef514b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -207,7 +207,7 @@ For simplicity, we will abbreviate the different models as follows:
207
  <figure>
208
 
209
  <caption>
210
- Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 8. Batch size for XNLI (length 256) is 32, while we needed to use 16 for XNLI (length 512).
211
  </caption>
212
 
213
  | Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI-256 (Acc) | XNLI-512 (Acc) |
@@ -226,7 +226,7 @@ Table 3. Metrics for different downstream tasks, comparing our different models
226
 
227
  In addition to the tasks above, we also trained the beta model on the SQUAD dataset, achieving exact match 50.96 and F1 68.74 (sequence length 128). A full evaluation of this task is still pending.
228
 
229
- To note: not intense tuning, epochs, etc. Still, good?? PAWS-X: weird (large differences and repeated base value). Repeated and same, with minor differences.Sometimes too short training? XNLI-512, runtime ~19h per model.
230
 
231
  ## Bias and ethics
232
 
207
  <figure>
208
 
209
  <caption>
210
+ Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 8. Batch size for XNLI (length 256) is 32, while we needed to use 16 for XNLI (length 512) All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs.
211
  </caption>
212
 
213
  | Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI-256 (Acc) | XNLI-512 (Acc) |
226
 
227
  In addition to the tasks above, we also trained the beta model on the SQUAD dataset, achieving exact match 50.96 and F1 68.74 (sequence length 128). A full evaluation of this task is still pending.
228
 
229
+ Results for PAWS-X seem surprising given the large differences in performance and the repeated 0.5765 baseline. However, this training was repeated and results seem consistent. Perhaps this (as well as the 0.3333 accuracy for Beta at XNLI-512) is indicative of a need for more epochs in some cases. However, this is not always feasible. For example, runtime for XNLI-512 was ~19h per model.
230
 
231
  ## Bias and ethics
232