bertin-project
/

bertin-roberta-base-spanish

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

Pablogps commited on Jul 23, 2021

Commit

bb5cf82

·

1 Parent(s): e04a2fb

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -97,7 +97,7 @@ We adjusted the `factor` parameter of the `Stepwise` function, and the `factor`
 <caption>Figure 4. Expected perplexity distributions of the sample mc4-es after applying Gaussian function.</caption>
 </figure>
-Figure 5 shows the actual perplexity distributions of the generated 50M subsets for each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. We adjusted our subsampling parameters so that we would sample ~50M examples from the original train split in mC4. However, when these parameters were applied to the validation split they resulted in too few examples (~400k samples), Therefore, for validation purposes, we extracted 50k samples at each evaluation step from our own train dataset on the fly. Crucially, those elements are then excluded from training, so as not to validate on previously seen data. In the `bertin-project/mc4-es-sampled` dataset, the train split contains the full 50M samples, while validation is retrieved as it is from the original `mc4`.
 ```python
 from datasets import load_dataset

 <caption>Figure 4. Expected perplexity distributions of the sample mc4-es after applying Gaussian function.</caption>
 </figure>
+Figure 5 shows the actual perplexity distributions of the generated 50M subsets for each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. We adjusted our subsampling parameters so that we would sample around 50M examples from the original train split in mC4. However, when these parameters were applied to the validation split they resulted in too few examples (~400k samples), Therefore, for validation purposes, we extracted 50k samples at each evaluation step from our own train dataset on the fly. Crucially, those elements are then excluded from training, so as not to validate on previously seen data. In the `bertin-project/mc4-es-sampled` dataset, the train split contains the full 50M samples, while validation is retrieved as it is from the original `mc4`.
 ```python
 from datasets import load_dataset