Update README.md
Browse files
README.md
CHANGED
@@ -130,7 +130,7 @@ for config in ("random", "stepwise", "gaussian"):
|
|
130 |
|
131 |
### Training details
|
132 |
|
133 |
-
We then used the same setup as Liu et al. (2019) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` trained for the 250k steps, while `Random` was stopped at 230k and `Stepwise` at 180k (this was a decision based on an analysis of training performance and the computational resources available at the time).
|
134 |
|
135 |
Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
|
136 |
|
@@ -140,10 +140,12 @@ For `Random` sampling we trained with seq len 512 during the last 20 steps of th
|
|
140 |
|
141 |
![](./images/random_512.jpg)
|
142 |
|
143 |
-
<caption>Figure 7. Training profile for Random sampling. Note the drop in performance after the change from 128 to 512 sequence
|
144 |
</figure>
|
145 |
|
146 |
-
For `Gaussian` sampling we started a new optimizer after 230 steps with 128
|
|
|
|
|
147 |
|
148 |
## Results
|
149 |
|
@@ -388,3 +390,5 @@ Given our good results, on par with those of large corporations, we hope our wor
|
|
388 |
- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave, Proceedings of the 12th Language Resources and Evaluation Conference (LREC), p. 4003-4012, May 2020.
|
389 |
|
390 |
- Heafield, K. (2011). KenLM: faster and smaller language model queries. In Proceedings of the EMNLP2011 Sixth Workshop on Statistical Machine Translation.
|
|
|
|
|
|
130 |
|
131 |
### Training details
|
132 |
|
133 |
+
We then used the same setup and hyperparameters as [Liu et al. (2019)](https://arxiv.org/abs/1907.11692) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` trained for the 250k steps, while `Random` was stopped at 230k and `Stepwise` at 180k (this was a decision based on an analysis of training performance and the computational resources available at the time).
|
134 |
|
135 |
Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
|
136 |
|
|
|
140 |
|
141 |
![](./images/random_512.jpg)
|
142 |
|
143 |
+
<caption>Figure 7. Training profile for Random sampling. Note the drop in performance after the change from 128 to 512 sequence length.</caption>
|
144 |
</figure>
|
145 |
|
146 |
+
For `Gaussian` sampling we started a new optimizer after 230 steps with 128 sequence length, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for `Random` (512), a difference much larger than that of their respective -128 models (0.6520 for `Random`, 0.6608 for `Gaussian`).
|
147 |
+
|
148 |
+
Batch size was 256 for training with 128 sequence length, and 48 for 512 sequence length, with no change in learning rate. Warmup steps for 512 was 500.
|
149 |
|
150 |
## Results
|
151 |
|
|
|
390 |
- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave, Proceedings of the 12th Language Resources and Evaluation Conference (LREC), p. 4003-4012, May 2020.
|
391 |
|
392 |
- Heafield, K. (2011). KenLM: faster and smaller language model queries. In Proceedings of the EMNLP2011 Sixth Workshop on Statistical Machine Translation.
|
393 |
+
|
394 |
+
- Liu et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.
|