Update README.md
Browse files
README.md
CHANGED
@@ -97,7 +97,7 @@ We adjusted the `factor` parameter of the `Stepwise` function, and the `factor`
|
|
97 |
<caption>Figure 4. Expected perplexity distributions of the sample mc4-es after applying Gaussian function.</caption>
|
98 |
</figure>
|
99 |
|
100 |
-
Figure 5 shows the actual perplexity distributions of the generated 50M subsets for each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. We adjusted our subsampling parameters so that we would sample
|
101 |
|
102 |
```python
|
103 |
from datasets import load_dataset
|
|
|
97 |
<caption>Figure 4. Expected perplexity distributions of the sample mc4-es after applying Gaussian function.</caption>
|
98 |
</figure>
|
99 |
|
100 |
+
Figure 5 shows the actual perplexity distributions of the generated 50M subsets for each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. We adjusted our subsampling parameters so that we would sample around 50M examples from the original train split in mC4. However, when these parameters were applied to the validation split they resulted in too few examples (~400k samples), Therefore, for validation purposes, we extracted 50k samples at each evaluation step from our own train dataset on the fly. Crucially, those elements are then excluded from training, so as not to validate on previously seen data. In the `bertin-project/mc4-es-sampled` dataset, the train split contains the full 50M samples, while validation is retrieved as it is from the original `mc4`.
|
101 |
|
102 |
```python
|
103 |
from datasets import load_dataset
|