Pablogps commited on
Commit
2eccc55
1 Parent(s): 27b6c61

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -134,7 +134,7 @@ for config in ("random", "stepwise", "gaussian"):
134
  <caption>Figure 6. Experimental perplexity distribution of the sampled mc4-es after applying Random sampling.</caption>
135
  </figure>
136
 
137
- Although this is not a comprehensive analysis, we looked into the distribution of perplexity for the training corpus. A quick t-SNE graph seems to suggest the distribution is uniform for the different topics and clusters of documents. The [interactive plot](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/raw/main/images/perplexity_colored_embeddings.html) was generated using [a distilled version of multilingual USE](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) to embed a random subset of 20,000 examples and each example is colored based on its perplexity. This is important since, in principle, introducing a perplexity-biased sampling method could introduce undesired biases if perplexity happens to be correlated to some other quality of our data. The code required to replicate this plote is available at `tsne_plot.py scrip`.
138
 
139
 
140
  ### Training details
 
134
  <caption>Figure 6. Experimental perplexity distribution of the sampled mc4-es after applying Random sampling.</caption>
135
  </figure>
136
 
137
+ Although this is not a comprehensive analysis, we looked into the distribution of perplexity for the training corpus. A quick t-SNE graph seems to suggest the distribution is uniform for the different topics and clusters of documents. The [interactive plot](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/raw/main/images/perplexity_colored_embeddings.html) was generated using [a distilled version of multilingual USE](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) to embed a random subset of 20,000 examples and each example is colored based on its perplexity. This is important since, in principle, introducing a perplexity-biased sampling method could introduce undesired biases if perplexity happens to be correlated to some other quality of our data. The code required to replicate this plote is available at `tsne_plot.py` scrip.
138
 
139
 
140
  ### Training details