bertin-project
/

bertin-roberta-base-spanish

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

Pablogps commited on Jul 24, 2021

Commit

2fd46c2

•

1 Parent(s): 72f4884

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -134,7 +134,7 @@ for config in ("random", "stepwise", "gaussian"):
 <caption>Figure 6. Experimental perplexity distribution of the sampled mc4-es after applying Random sampling.</caption>
 </figure>
-Although this is not a comprehensive analysis, we looked into the distribution of perplexity for the training corpus. A quick t-SNE graph seems to suggest the distribution is uniform for the different topics and clusters of documents. The interactive plot (**perplexity_colored_embeddings.html**) is available in the **images** folder.
 ### Training details
@@ -373,7 +373,7 @@ New tools always require a period of adaptation in the working flow. For instanc
 The results we present in this project are very promising, and we believe they hold great value for the community as a whole. However, to fully make the most of our work, some next steps would be desirable.
-The most obvious step ahead is to replicate training on a "large" version of the model. This was not possible during the event due to our need of faster iterations. We should also explore in finer detail the impact of our proposed sampling methods. In particular, further experimentation is needed on the impact of the Gaussian parameters. Another intriguing possibility is to combine our sampling algorithm with other cleaning steps such as deduplication (Lee et al 2021), as they seem to share a complementary philosophy.
 # Conclusions

 <caption>Figure 6. Experimental perplexity distribution of the sampled mc4-es after applying Random sampling.</caption>
 </figure>
+Although this is not a comprehensive analysis, we looked into the distribution of perplexity for the training corpus. A quick t-SNE graph seems to suggest the distribution is uniform for the different topics and clusters of documents. The interactive plot (**perplexity_colored_embeddings.html**) is available in the **images** folder. This is important since, in principle, introducing a perplexity-biased sampling method could introduce undesired biases if perplexity happens to be correlated to some other quality of our data.
 ### Training details
 The results we present in this project are very promising, and we believe they hold great value for the community as a whole. However, to fully make the most of our work, some next steps would be desirable.
+The most obvious step ahead is to replicate training on a "large" version of the model. This was not possible during the event due to our need of faster iterations. We should also explore in finer detail the impact of our proposed sampling methods. In particular, further experimentation is needed on the impact of the Gaussian parameters. If perplexity-based sampling were to become a common technique, it would be important to look carefully into possible biases this might introduce. Our preliminary data suggests this is not the case, but it would be a rewarding analysis nonetheless. Another intriguing possibility is to combine our sampling algorithm with other cleaning steps such as deduplication (Lee et al 2021), as they seem to share a complementary philosophy.
 # Conclusions