bertin-project
/

bertin-roberta-base-spanish

@@ -363,6 +363,17 @@ The differences in performance for models trained using different data-sampling
 As already mentiond in the Training details section, the methodology used to extend sequence length during training is critical. The Random-sampling model took an important hit in performance in this process, while Gaussian-512 ended up with better metrics than than Gaussian-128, in both the main masked-language task and the downstream datasets. The key difference was that Random kept the optimizer intact while Gaussian used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.
 # Conclusions
 With roughly 10 days worth of access to 3xTPUv3-8, we have achieved remarkable results surpassing previous state of the art in a few tasks, and even improving document classification on models trained in massive supercomputers with very large—private—and highly-curated datasets.
@@ -394,8 +405,10 @@ Given our good results, on par with those of large corporations, we hope our wor
 ## References
-- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave, Proceedings of the 12th Language Resources and Evaluation Conference (LREC), p. 4003-4012, May 2020.
-- Heafield, K. (2011). KenLM: faster and smaller language model queries. In Proceedings of the EMNLP2011 Sixth Workshop on Statistical Machine Translation.
 - Liu et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.

 As already mentiond in the Training details section, the methodology used to extend sequence length during training is critical. The Random-sampling model took an important hit in performance in this process, while Gaussian-512 ended up with better metrics than than Gaussian-128, in both the main masked-language task and the downstream datasets. The key difference was that Random kept the optimizer intact while Gaussian used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.
+# Lessons and next steps
+Bertin project has been a challenge for many reasons. Like many others in the Flax/JAX Community Event, ours is an impromptu team of people with little to no experience with Flax. Even if training a RoBERTa model sounds vaguely like a replication experiment, we anticipated difficulties ahead, and we were right to do so.
+New tools always require a period of adaptation in the working flow. For instance, lacking—to the best of our knowledge—a monitoring tool equivalent to Nvidia-smi, simple procedures like optimizing batch sizes become troublesome. Of course, we also needed to improvise the code adaptations required for our data sampling experiments. Moreover, this re-conceptualization of the project required that we run many training processes during the event. This is another reason why saving and restoring checkpoints was a must for our success—another reason being our planned switch from 128 to 512 sequence length—. However, such code was not available at the start of the Community Event. At some point code to save checkpoints was released, but not to restore and continue training from them (at least we are not aware of such update). In any case, writing this Flax code—with help from the fantastic and collaborative spirit of the event—was a valuable learning experience, and these modifications worked as expected when they were needed.
+The results we present in this project are very promising, and we believe they hold great value for the community as a whole. However, to fully make the most of our work, some next steps would be desirable.
+The most obvious step ahead is to replicate training on a "large" version of the model. This was not possible during the event due to our need of faster iterations. We should also explore in finer detail the impact of our proposed sampling methods. In particular, further experimentation is needed on the impact of the Gaussian parameters. Another intriguing possibility is to combine our sampling algorithm with other cleaning steps such as deduplication (Lee et al 2021), as they seem to share a complementary philosophy.
 # Conclusions
 With roughly 10 days worth of access to 3xTPUv3-8, we have achieved remarkable results surpassing previous state of the art in a few tasks, and even improving document classification on models trained in massive supercomputers with very large—private—and highly-curated datasets.
 ## References
+- Wenzek et al. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. Proceedings of the 12th Language Resources and Evaluation Conference (LREC), p. 4003-4012, May 2020.
+- Heafield, K. (2011). KenLM: faster and smaller language model queries. Proceedings of the EMNLP2011 Sixth Workshop on Statistical Machine Translation.
+- Lee et al. (2021). Deduplicating Training Data Makes Language Models Better.
 - Liu et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.