Pablogps commited on
Commit
72f4884
1 Parent(s): aa5a58e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -12
README.md CHANGED
@@ -134,6 +134,8 @@ for config in ("random", "stepwise", "gaussian"):
134
  <caption>Figure 6. Experimental perplexity distribution of the sampled mc4-es after applying Random sampling.</caption>
135
  </figure>
136
 
 
 
137
 
138
  ### Training details
139
 
@@ -211,26 +213,26 @@ For simplicity, we will abbreviate the different models as follows:
211
  <figure>
212
 
213
  <caption>
214
- Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 8. Batch size for XNLI (length 256) is 32, while we needed to use 16 for XNLI (length 512) All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs.
215
  </caption>
216
 
217
  | Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI-256 (Acc) | XNLI-512 (Acc) |
218
- |--------------|-------------------------|----------------------|--------------|-----------------|--------------|
219
- | BERT-m | 0.9629 / 0.9687 | 0.8539 / 0.9779 | 0.5765 | 0.7852 | WIP |
220
- | BERT-wwm | 0.9642 / 0.9700 | 0.8579 / 0.9783 | 0.8720 | **0.8186** | WIP |
221
- | BSC-BNE | 0.9659 / 0.9707 | 0.8700 / 0.9807 | 0.5765 | 0.8178 | WIP |
222
- | Beta | 0.9638 / 0.9690 | 0.8725 / 0.9812 | 0.5765 | — | 0.3333 |
223
- | Random | 0.9656 / 0.9704 | 0.8704 / 0.9807 | 0.8800 | 0.7745 | 0.7795 |
224
- | Stepwise | 0.9656 / 0.9707 | 0.8705 / 0.9809 | 0.8825 | 0.7820 | 0.7799 |
225
- | Gaussian | 0.9662 / 0.9709 | **0.8792 / 0.9816** | 0.8875 | 0.7942 | 0.7843 |
226
- | Random-512 | 0.9660 / 0.9707 | 0.8616 / 0.9803 | 0.6735 | 0.7723 | 0.7799 |
227
- | Gaussian-512 | **0.9662 / 0.9714** | **0.8764 / 0.9819** | **0.8965** | 0.7878 | 0.7843 |
228
 
229
  </figure>
230
 
231
  In addition to the tasks above, we also trained the beta model on the SQUAD dataset, achieving exact match 50.96 and F1 68.74 (sequence length 128). A full evaluation of this task is still pending.
232
 
233
- Results for PAWS-X seem surprising given the large differences in performance and the repeated 0.5765 baseline. However, this training was repeated and results seem consistent. Perhaps this (as well as the 0.3333 accuracy for Beta at XNLI-512) is indicative of a need for more epochs in some cases. However, this is not always feasible. For example, runtime for XNLI-512 was ~19h per model.
234
 
235
  ## Bias and ethics
236
 
 
134
  <caption>Figure 6. Experimental perplexity distribution of the sampled mc4-es after applying Random sampling.</caption>
135
  </figure>
136
 
137
+ Although this is not a comprehensive analysis, we looked into the distribution of perplexity for the training corpus. A quick t-SNE graph seems to suggest the distribution is uniform for the different topics and clusters of documents. The interactive plot (**perplexity_colored_embeddings.html**) is available in the **images** folder.
138
+
139
 
140
  ### Training details
141
 
 
213
  <figure>
214
 
215
  <caption>
216
+ Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 8. Batch size for XNLI (length 256) is 32, while we needed to use 16 for XNLI (length 512) All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs. Results marked with * indicate a repetition.
217
  </caption>
218
 
219
  | Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI-256 (Acc) | XNLI-512 (Acc) |
220
+ |--------------|-------------------------|----------------------|--------------|-----------------|--------------|0.
221
+ | BERT-m | 0.9629 / 0.9687 | 0.8539 / 0.9779 | 0.5765* | 0.7852 | 0.7606 |
222
+ | BERT-wwm | 0.9642 / 0.9700 | 0.8579 / 0.9783 | 0.8720* | **0.8186** | 0.8012* |
223
+ | BSC-BNE | 0.9659 / 0.9707 | 0.8700 / 0.9807 | 0.5765* | 0.8178 | 0.3333* |
224
+ | Beta | 0.9638 / 0.9690 | 0.8725 / 0.9812 | 0.5765* | — | 0.7751* |
225
+ | Random | 0.9656 / 0.9704 | 0.8704 / 0.9807 | 0.8800* | 0.7745 | 0.7795 |
226
+ | Stepwise | 0.9656 / 0.9707 | 0.8705 / 0.9809 | 0.8825* | 0.7820 | 0.7799 |
227
+ | Gaussian | 0.9662 / 0.9709 | **0.8792 / 0.9816** | 0.8875* | 0.7942 | 0.7843 |
228
+ | Random-512 | 0.9660 / 0.9707 | 0.8616 / 0.9803 | 0.6735* | 0.7723 | 0.7799 |
229
+ | Gaussian-512 | **0.9662 / 0.9714** | **0.8764 / 0.9819** | **0.8965** * | 0.7878 | 0.7843 |
230
 
231
  </figure>
232
 
233
  In addition to the tasks above, we also trained the beta model on the SQUAD dataset, achieving exact match 50.96 and F1 68.74 (sequence length 128). A full evaluation of this task is still pending.
234
 
235
+ Results for PAWS-X seem surprising given the large differences in performance and the repeated 0.5765 baseline. However, this training was repeated and results seem consistent. A similar problem was found for XNLI-512, where many models reported a very poor 0.3333 accuracy on a first run (and even a second, in the case of BSC-BNE). This suggests training is a bit unstable for some datasets under this conditions. Increasing the number of epochs seems like a natural attempt to fix this problem, however, this is not feasible within the project schedule. For example, runtime for XNLI-512 was ~19h per model.
236
 
237
  ## Bias and ethics
238