bertin-project
/

bertin-roberta-base-spanish

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

edugp commited on Jul 24, 2021

Commit

15b1a84

•

1 Parent(s): cdd23d1

Fix explanation on link to mUSE model

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -134,7 +134,7 @@ for config in ("random", "stepwise", "gaussian"):
 <caption>Figure 6. Experimental perplexity distribution of the sampled mc4-es after applying Random sampling.</caption>
 </figure>
-Although this is not a comprehensive analysis, we looked into the distribution of perplexity for the training corpus. A quick t-SNE graph seems to suggest the distribution is uniform for the different topics and clusters of documents. The interactive plot (**perplexity_colored_embeddings.html**) is available in the **images** folder. This graph was generated used a distilled version of [mUSE](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) to embed a random subset of 20,000 examples. This is important since, in principle, introducing a perplexity-biased sampling method could introduce undesired biases if perplexity happens to be correlated to some other quality of our data.
 ### Training details

 <caption>Figure 6. Experimental perplexity distribution of the sampled mc4-es after applying Random sampling.</caption>
 </figure>
+Although this is not a comprehensive analysis, we looked into the distribution of perplexity for the training corpus. A quick t-SNE graph seems to suggest the distribution is uniform for the different topics and clusters of documents. The interactive plot (**perplexity_colored_embeddings.html**) is available in the **images** folder. This graph was generated used [a distilled version of multilingual USE](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) to embed a random subset of 20,000 examples. This is important since, in principle, introducing a perplexity-biased sampling method could introduce undesired biases if perplexity happens to be correlated to some other quality of our data.
 ### Training details