bertin-project
/

bertin-roberta-base-spanish

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

Pablogps commited on Jul 25, 2021

Commit

17ecec6

•

1 Parent(s): e1bf88a

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -89,7 +89,7 @@ We adjusted the `factor` parameter of the `Stepwise` function, and the `factor`
 <figure>
-![](./images/perp-resample-stepwise.png)
 <caption>Figure 3. Expected perplexity distributions of the sample mc4-es after applying the Stepwise function.</caption>
 </figure>
@@ -139,7 +139,7 @@ Although this is not a comprehensive analysis, we looked into the distribution o
 ### Training details
-We then used the same setup and hyperparameters as [Liu et al. (2019)](https://arxiv.org/abs/1907.11692) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` trained for the 250k steps, while `Random` was stopped at 230k and `Stepwise` at 180k (this was a decision based on an analysis of training performance and the computational resources available at the time).
 Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.

 <figure>
+![](./images/perp-resample.png)
 <caption>Figure 3. Expected perplexity distributions of the sample mc4-es after applying the Stepwise function.</caption>
 </figure>
 ### Training details
+We then used the same setup and hyperparameters as [Liu et al. (2019)](https://arxiv.org/abs/1907.11692) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` trained for the 250k steps, while `Random` was stopped at 230k. `Stepwise` needed to be initially stopped at 180k to allow downstream tests (sequence length 128), but was later resumed. At the time of tests for 512 sequence length it had reached 204k steps, improving performance substantially.
 Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.