Pablogps commited on
Commit
cd3cf59
1 Parent(s): 5ba7dcf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -6
README.md CHANGED
@@ -120,11 +120,11 @@ for split in ("random", "stepwise", "gaussian"):
120
 
121
  We then used the same setup as Liu et al. (2019) but trained only for half the steps (250k) on a sequence length of 128. Then, we continued training the most promising model for 25k more on sequence length 512.
122
 
123
- Our first test, tagged `beta` in this repository, refers to an initial experiment using `stepwise` but a small factor to oversample everything.
124
-
125
  ## Results
126
 
127
- Our first test, tagged `beta` in this repository, refers to an initial experiment using `stepwise` on 128 sequence lengths but a small `factor` to oversample everything. During the community event, the Barcelona Supercomputing Center in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that further cleaned up to the final 570GB. In all our experiments and procedures, we had access to 3xTPUv3-8 for 10 days to do cleaning, sampling, taining, and evaluation. The BSC team evaluated our early release of the model `beta` and the results can be seen in Table 1. We are no waiting for the evaluation on the rest of our experiments to finish. The final models were trained on different number of steps and sequence lengths and achieve different masked word prediction accuracies. Some of the datasets used for evaluation are not freely available, therefore we are not in position to verify the figures.
 
 
128
 
129
  <figure>
130
 
@@ -145,11 +145,12 @@ Our first test, tagged `beta` in this repository, refers to an initial experimen
145
 
146
  # Conclusions
147
 
148
- With roughly 10 days to access to TPUs, we have achieve remarkable results surpassing previous state of the art in a few tasks, and even improving document classification on models trained in massive supercomputers with humongous private and highly curated datasets.
 
 
149
 
150
- The expericence has been incredible and we feel this kind of events provide an amazing opportunity for small teams on low or non-existent budgets to learn how the big players in the field pre-train their models. The trade-off between learning and experimenting, and being beta-testers of libraries (Flax/JAX) and infrastructure (TPU VMs) is a marginal cost to pay compared to the benefits of access.
151
 
152
- We hope our work set the basis for more small teams playing and experimenting with language models training on small subsets of data and for shorter times, since the performance of our models is on par with those trained on big machines for long times.
153
  ## Team members
154
 
155
  - Javier de la Rosa ([versae](https://huggingface.co/versae))
120
 
121
  We then used the same setup as Liu et al. (2019) but trained only for half the steps (250k) on a sequence length of 128. Then, we continued training the most promising model for 25k more on sequence length 512.
122
 
 
 
123
  ## Results
124
 
125
+ Our first test, tagged `beta` in this repository, refers to an initial experiment using `stepwise` on 128 sequence lengths but a small `factor` to oversample everything. During the community event, the Barcelona Supercomputing Center (BSC) in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that were further cleaned up to the final 570GB. In all our experiments and procedures, we had access to 3xTPUv3-8 for 10 days to do cleaning, sampling, taining, and evaluation. The BSC team evaluated our early release of the model `beta` and the results can be seen in Table 1.
126
+
127
+ Our final models were trained on a different number of steps and sequence lengths and achieve different—higher—masked-word prediction accuracies. Despite these limitations it is interesting to see the results they obtained using the early version of our model. Note that some of the datasets used for evaluation by BSC are not freely available, therefore it is not possible to verify the figures.
128
 
129
  <figure>
130
 
145
 
146
  # Conclusions
147
 
148
+ With roughly 10 days worth of access to 3xTPUv3-8, we have achieved remarkable results surpassing previous state of the art in a few tasks, and even improving document classification on models trained in massive supercomputers with very large—privateand highly curated datasets.
149
+
150
+ The experience has been incredible and we feel this kind of events provide an amazing opportunity for small teams on low or non-existent budgets to learn how the big players in the field pre-train their models. The trade-off between learning and experimenting, and being beta-testers of libraries (Flax/JAX) and infrastructure (TPU VMs) is a marginal cost to pay compared to the benefits such access has to offer.
151
 
152
+ We hope our work will set the basis for more small teams playing and experimenting with language models training on small subsets of data with reduced training times, since the performance of our models is on par with those trained on big machines for longer times.
153
 
 
154
  ## Team members
155
 
156
  - Javier de la Rosa ([versae](https://huggingface.co/versae))