Pablogps commited on
Commit
c6528cc
1 Parent(s): 0750d4e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -139,9 +139,9 @@ Although this is not a comprehensive analysis, we looked into the distribution o
139
 
140
  ### Training details
141
 
142
- We then used the same setup and hyperparameters as [Liu et al. (2019)](https://arxiv.org/abs/1907.11692) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` trained for the 250k steps, while `Random` was stopped at 230k. `Stepwise` needed to be initially stopped at 180k to allow downstream tests (sequence length 128), but was later resumed. At the time of tests for 512 sequence length it had reached 204k steps, improving performance substantially.
143
 
144
- Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
145
 
146
  For `Random` sampling we trained with seq len 512 during the last 20 steps of the 250 training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7:
147
 
@@ -154,6 +154,8 @@ For `Random` sampling we trained with seq len 512 during the last 20 steps of th
154
 
155
  For `Gaussian` sampling we started a new optimizer after 230 steps with 128 sequence length, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for `Random` (512), a difference much larger than that of their respective -128 models (0.6520 for `Random`, 0.6608 for `Gaussian`).
156
 
 
 
157
  Batch size was 2048 for training with 128 sequence length, and 384 for 512 sequence length, with no change in learning rate. Warmup steps for 512 was 500.
158
 
159
  ## Results
139
 
140
  ### Training details
141
 
142
+ We then used the same setup and hyperparameters as [Liu et al. (2019)](https://arxiv.org/abs/1907.11692) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` trained for the 250k steps, while `Random` was stopped at 230k. `Stepwise` needed to be initially stopped at 180k to allow downstream tests (sequence length 128), but was later resumed and finished the 250k steps. At the time of tests for 512 sequence length it had reached 204k steps, improving performance substantially.
143
 
144
+ Then, we continued training the most promising models for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
145
 
146
  For `Random` sampling we trained with seq len 512 during the last 20 steps of the 250 training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7:
147
 
154
 
155
  For `Gaussian` sampling we started a new optimizer after 230 steps with 128 sequence length, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for `Random` (512), a difference much larger than that of their respective -128 models (0.6520 for `Random`, 0.6608 for `Gaussian`).
156
 
157
+ A 512-version of Stepwise is currently training.
158
+
159
  Batch size was 2048 for training with 128 sequence length, and 384 for 512 sequence length, with no change in learning rate. Warmup steps for 512 was 500.
160
 
161
  ## Results