Pablogps commited on
Commit
14b1ca9
1 Parent(s): f37d879

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -94,7 +94,7 @@ We adjusted the `factor` parameter of the `Stepwise` function, and the `factor`
94
  <caption>Figure 4. Expected perplexity distributions of the sample mc4-es after applying Gaussian function.</caption>
95
  </figure>
96
 
97
- Figure 5 shows the actual perplexity distributions of the generated 50M subsets for each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. We adjusted our subsampling parameters so that we would sample ~50M examples from the original train split in mC4. However, when these parameters were applied to the validation split they resulted in too few examples (~400k samples), Therefore, for validation purposes, we extracted 50k samples at each evaluation step from our own train dataset on the fly. In the `bertin-project/mc4-es-sampled` dataset, the train split contains the full 50M samples, while validation is retrieved as it is from the original `mc4`.
98
 
99
  ```python
100
  from datasets import load_dataset
94
  <caption>Figure 4. Expected perplexity distributions of the sample mc4-es after applying Gaussian function.</caption>
95
  </figure>
96
 
97
+ Figure 5 shows the actual perplexity distributions of the generated 50M subsets for each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. We adjusted our subsampling parameters so that we would sample ~50M examples from the original train split in mC4. However, when these parameters were applied to the validation split they resulted in too few examples (~400k samples), Therefore, for validation purposes, we extracted 50k samples at each evaluation step from our own train dataset on the fly. Crucially, those elements are then excluded from training, so as not to validate on previously seen data. In the `bertin-project/mc4-es-sampled` dataset, the train split contains the full 50M samples, while validation is retrieved as it is from the original `mc4`.
98
 
99
  ```python
100
  from datasets import load_dataset