Paulo commited on
Commit
aefcc19
1 Parent(s): 47107c7

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +6 -5
app.py CHANGED
@@ -63,14 +63,15 @@ st.markdown(
63
  [Flax/Jax Community Week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104)
64
  organised by HuggingFace.
65
 
66
- All models are variations of **RoBERTa-base** trained from scratch in **Spanish** using the **mc4 dataset**.
67
  We reduced the dataset size to 50 million documents to keep training times shorter, and also to be able to bias training examples based on their perplexity.
68
 
69
- The idea is to favour examples with perplexities that are neither too small (short, repetitive texts) or too long (potentially poor quality).
70
- * **Random** sampling simply takes documents at random to reduce the dataset size.
71
- * **Gaussian** rejects documents with a higher probability for lower and larger perplexities, based on a Gaussian function.
 
72
 
73
- The first models have been trained (250.000 steps) on sequence length 128, and training for Gaussian changed to sequence length 512 for the last 25.000 training steps.
74
 
75
  Please read our [full report](https://huggingface.co/bertin-project/bertin-roberta-base-spanish) for more details on the methodology and metrics on downstream tasks.
76
  """
 
63
  [Flax/Jax Community Week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104)
64
  organised by HuggingFace.
65
 
66
+ All models are variations of **RoBERTa-base** trained from scratch in **Spanish** using a sample from the **mc4 dataset**.
67
  We reduced the dataset size to 50 million documents to keep training times shorter, and also to be able to bias training examples based on their perplexity.
68
 
69
+ The idea is to favour examples with perplexities that are neither too small (short, repetitive texts) or too long (potentially poor quality). There are three versions of the sampling procedure (producing three different series of models):
70
+ * **Random** sampling is the control baseline and simply takes documents at random with uniform probability to reduce the dataset size.
71
+ * **Gaussian** rejects documents with higher probability for lower and larger perplexities, based on weighting the perplexity distribution with a Gaussian function.
72
+ * **Stepwise** applies different four sampling probabilities to each of the four quartiles of the perplexity distribution.
73
 
74
+ The first models have been trained (250.000 steps) on sequence length 128, and then training for Gaussian changed to sequence length 512 for the last 25.000 training steps to yield another version.
75
 
76
  Please read our [full report](https://huggingface.co/bertin-project/bertin-roberta-base-spanish) for more details on the methodology and metrics on downstream tasks.
77
  """