versae commited on
Commit
201f63a
1 Parent(s): 418d217

More small changes

Browse files
Files changed (1) hide show
  1. README.md +22 -16
README.md CHANGED
@@ -252,21 +252,28 @@ In addition to the tasks above, we also trained the [`beta`](https://huggingface
252
 
253
  Results for PAWS-X seem surprising given the large differences in performance. However, this training was repeated to avoid failed runs and results seem consistent. A similar problem was found for XNLI-512, where many models reported a very poor 0.3333 accuracy on a first run (and even a second, in the case of BSC-BNE). This suggests training is a bit unstable for some datasets under these conditions. Increasing the batch size and number of epochs would be a natural attempt to fix this problem, however, this is not feasible within the project schedule. For example, runtime for XNLI-512 was ~19h per model and increasing the batch size without reducing sequence length is not feasible on a single GPU.
254
 
 
 
 
 
 
 
 
255
  ## Bias and ethics
256
 
257
- While a rigorous analysis of our models and datasets for bias was out of the scope of our project (given the very tight schedule and our lack of experience on JAX/FLAX), this issue has still played an important role in our motivation. Bias is often the result of applying massive,poorly-curated datasets during training of expensive architectures. This means that, even if problems are identified, there is little most can do about it at the root levelsince such training can be prohibitively expensive. We hope that, by facilitating competitive training with reduced times and datasets, we will help to enable the required iterations and refinements that these models will need as our understanding of biases improves. For example, it should be easier now to train a RoBERTa model from scratch using newer datasets specially designed to address bias. This is surely an exciting prospect, and we hope that this work will contribute in this challenge.
258
 
259
- Even if a rigorous analysis of bias is difficult, we should not use that excuse to disregard the issue in any project. Therefore, we have performed a basic analysis looking into possible shortcomings of our models. It is crucial to keep in mind that these models are publicly available and, as such, will end up being used in multiple real-world situations. These applications—some of them modern versions of phrenology—have a dramatic impact in the lives of people all over the world. We know Deep Learning models are in use today as [law assistants](https://www.wired.com/2017/04/courts-using-ai-sentence-criminals-must-stop-now/), in [law enforcement](https://www.washingtonpost.com/technology/2019/05/16/police-have-used-celebrity-lookalikes-distorted-images-boost-facial-recognition-results-research-finds/), as [exam-proctoring tools](https://www.wired.com/story/ai-college-exam-proctors-surveillance/) (also [this](https://www.eff.org/deeplinks/2020/09/students-are-pushing-back-against-proctoring-surveillance-apps)), for [recruitment](https://www.washingtonpost.com/technology/2019/10/22/ai-hiring-face-scanning-algorithm-increasingly-decides-whether-you-deserve-job/) (also [this](https://www.technologyreview.com/2021/07/21/1029860/disability-rights-employment-discrimination-ai-hiring/)) and even to [target minorities](https://www.insider.com/china-is-testing-ai-recognition-on-the-uighurs-bbc-2021-5). Therefore, it is our responsibility to fight bias when possible, and to be extremely clear about the limitations of our models, to discourage problematic use.
260
 
261
  ### Bias examples (Spanish)
262
 
263
- Note that this analysis is slightly more difficult to do in Spanish since gender concordance reveals hints beyond masks. Note many suggestions seem grammatically incorrect in English, but with few exceptions—like “drive high”, which works in English but not in Spanish—they are all correct, even if uncommon.
264
 
265
- Results show that bias is apparent even in a quick and shallow analysis like this. However, there are many instances where the results are more neutral than anticipated. For instance, the first option to do the dishes is the son, and pink is nowhere to be found in the colour recommendations for a girl. Women seem to drive “high”, fast, strong and well, but “not a lot”.
266
 
267
- But before we get complacent, the model reminds us that the place of the woman is at home or the bed (!), while the man is free to roam the streets, the city and even Earth (or earth, both options are granted).
268
 
269
- Similar conclusions are derived from examples focusing on race and religion. Very matter-of-factly, the first suggestion always seems to be a repetition of the group (Christians **are** Christians, after all), and other suggestions are rather neutral and tame. However, there are some worrisome proposals. For example, the fourth option for Jews is that they are racist. Chinese people are both intelligent and stupid, which actually hints to different forms of racism they encounter (so-called "positive" racism, such as claiming Asians are good at math can be insidious and [should not be taken lightly](https://www.health.harvard.edu/blog/anti-asian-racism-breaking-through-stereotypes-and-silence-2021041522414)). Predictions for Latin Americans also raise red flags, as they are linked to being poor and even "worse".
270
 
271
  The model also seems to suffer from geographical bias, producing words that are more common in Spain than other countries. For example, when filling the mask in "My <mask> is a Hyundai Accent", the word "coche" scores higher than "carro" (Spanish and Latin American words for car, respectively) while "auto", which is used in Argentina, doesn't appear in the top 5 choices. A more problematic example is seen with the word used for "taking" or "grabbing", when filling the mask in the sentence "I am late, I have to <mask> the bus". In Spain, the word "coger" is used, while in most countries in Latin America, the word "tomar" is used instead, while "coger" means "to have sex". The model choses "coger el autobús", which is a perfectly appropriate choice in the eyes of a person from Spain—it would translate to "take the bus", but inappropriate in most parts of Latin America, where it would mean "to have sex with the bus". Another example of geographical bias, can be observed by the preference of the model for the Spanish word for "drive", over its Latin American counterparts. Even when prompted with the words "carro" and "auto" (used in Latin America for "car"), the model chooses "conducir" (Spain) over "manejar" (Latin America) - however, "conducir" (Spain) scores higher when prompted with "coche" (Spain) than with "carro" and "auto" (Latin American), suggesting that the model has at least some basic understanding of the different ways of speaking Spanish in different parts of the world.
272
 
@@ -413,30 +420,29 @@ Geographical bias
413
 
414
  The performance of our models has been, in general, very good. Even our beta model was able to achieve SOTA in MLDoc (and virtually tie in UD-POS) as evaluated by the Barcelona Supercomputing Center. In the main masked-language task our models reach values between 0.65 and 0.69, which foretells good results for downstream tasks.
415
 
416
- Our analysis of downstream tasks is not yet complete. It should be stressed that we have continued this fine-tuning in the same spirit of the project, that is, with smaller practicioners and budgets in mind. Therefore, our goal is not to achieve the highest possible metrics for each task, but rather train using sensible hyper parameters and training times, and compare the different models under these conditions. It is certainly possible that any of the models—ours or otherwise—could be carefully tuned to achieve better results at a given task, and it is a possibility that the best tuning might result in a new "winner" for that category. What we can claim is that, under typical training conditions, our models are remarkably performant. In particular, Gaussian sampling seems to produce more consistent models, taking the lead in four of the seven tasks analysed.
417
 
418
- The differences in performance for models trained using different data-sampling techniques are consistent. Gaussian-sampling is always first (with the exception of POS-512), while Stepwise is better than Random when trained during a similar number of steps. This proves that the sampling technique is, indeed, relevant.
419
 
420
- As already mentiond in the Training details section, the methodology used to extend sequence length during training is critical. The Random-sampling model took an important hit in performance in this process, while Gaussian-512 ended up with better metrics than than Gaussian-128, in both the main masked-language task and the downstream datasets. The key difference was that Random kept the optimizer intact while Gaussian used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.
421
 
422
  # Lessons and next steps
423
 
424
- Bertin project has been a challenge for many reasons. Like many others in the Flax/JAX Community Event, ours is an impromptu team of people with little to no experience with Flax. Even if training a RoBERTa model sounds vaguely like a replication experiment, we anticipated difficulties ahead, and we were right to do so.
425
 
426
- New tools always require a period of adaptation in the working flow. For instance, lacking—to the best of our knowledge—a monitoring tool equivalent to Nvidia-smi, simple procedures like optimizing batch sizes become troublesome. Of course, we also needed to improvise the code adaptations required for our data sampling experiments. Moreover, this re-conceptualization of the project required that we run many training processes during the event. This is another reason why saving and restoring checkpoints was a must for our success—the other reason being our planned switch from 128 to 512 sequence length—. However, such code was not available at the start of the Community Event. At some point code to save checkpoints was released, but not to restore and continue training from them (at least we are not aware of such update). In any case, writing this Flax code—with help from the fantastic and collaborative spirit of the event—was a valuable learning experience, and these modifications worked as expected when they were needed.
427
 
428
  The results we present in this project are very promising, and we believe they hold great value for the community as a whole. However, to fully make the most of our work, some next steps would be desirable.
429
 
430
- The most obvious step ahead is to replicate training on a "large" version of the model. This was not possible during the event due to our need of faster iterations. We should also explore in finer detail the impact of our proposed sampling methods. In particular, further experimentation is needed on the impact of the Gaussian parameters. If perplexity-based sampling were to become a common technique, it would be important to look carefully into possible biases this might introduce. Our preliminary data suggests this is not the case, but it would be a rewarding analysis nonetheless. Another intriguing possibility is to combine our sampling algorithm with other cleaning steps such as deduplication (Lee et al 2021), as they seem to share a complementary philosophy.
431
-
432
 
433
  # Conclusions
434
 
435
- With roughly 10 days worth of access to 3xTPUv3-8, we have achieved remarkable results surpassing previous state of the art in a few tasks, and even improving document classification on models trained in massive supercomputers with very large—private—and highly-curated datasets.
436
 
437
- The very big size of the datasets available looked enticing while formulating the project, however, it soon proved to be an important challenge given time constraints. This lead to a debate within the team and ended up reshaping our project and goals, now focusing on analysing this problem and how we could improve this situation for smaller teams like ours in the future. The subsampling techniques analysed in this report have shown great promise in this regard, and we hope to see other groups use them and improve them in the future.
438
 
439
- At a personal level, the experience has been incredible. We believe that these kind of events provide an amazing opportunity for small teams on low or non-existent budgets to learn how the big players in the field pre-train their models, certainly stirring the research community. The trade-off between learning and experimenting, and being beta-testers of libraries (Flax/JAX) and infrastructure (TPU VMs) is a marginal cost to pay compared to the benefits such access has to offer.
440
 
441
  Given our good results, on par with those of large corporations, we hope our work will inspire and set the basis for more small teams to play and experiment with language models on smaller subsets of huge datasets.
442
 
252
 
253
  Results for PAWS-X seem surprising given the large differences in performance. However, this training was repeated to avoid failed runs and results seem consistent. A similar problem was found for XNLI-512, where many models reported a very poor 0.3333 accuracy on a first run (and even a second, in the case of BSC-BNE). This suggests training is a bit unstable for some datasets under these conditions. Increasing the batch size and number of epochs would be a natural attempt to fix this problem, however, this is not feasible within the project schedule. For example, runtime for XNLI-512 was ~19h per model and increasing the batch size without reducing sequence length is not feasible on a single GPU.
254
 
255
+ We are also releasing the fine-tuned models for `Gaussian`-512 and making it our version 1.0 (current).
256
+
257
+ - POS: [`bertin-project/bertin-base-pos-conll2002-es`](https://huggingface.co/bertin-project/bertin-base-pos-conll2002-es/)
258
+ - NER: [`bertin-project/bertin-base-ner-conll2002-es`](https://huggingface.co/bertin-project/bertin-base-ner-conll2002-es/)
259
+ - PAWS-X: [`bertin-project/bertin-base-paws-x-es`](https://huggingface.co/bertin-project/bertin-base-paws-x-es)
260
+ - XNLI: [`bertin-project/bertin-base-xnli-es`](https://huggingface.co/bertin-project/bertin-base-xnli-es)
261
+
262
  ## Bias and ethics
263
 
264
+ While a rigorous analysis of our models and datasets for bias was out of the scope of our project (given the very tight schedule and our lack of experience on Flax/JAX), this issue has still played an important role in our motivation. Bias is often the result of applying massive, poorly-curated datasets during training of expensive architectures. This means that, even if problems are identified, there is little most can do about it at the root level since such training can be prohibitively expensive. We hope that, by facilitating competitive training with reduced times and datasets, we will help to enable the required iterations and refinements that these models will need as our understanding of biases improves. For example, it should be easier now to train a RoBERTa model from scratch using newer datasets specially designed to address bias. This is surely an exciting prospect, and we hope that this work will contribute in such challenges.
265
 
266
+ Even if a rigorous analysis of bias is difficult, we should not use that excuse to disregard the issue in any project. Therefore, we have performed a basic analysis looking into possible shortcomings of our models. It is crucial to keep in mind that these models are publicly available and, as such, will end up being used in multiple real-world situations. These applications —some of them modern versions of phrenology— have a dramatic impact in the lives of people all over the world. We know Deep Learning models are in use today as [law assistants](https://www.wired.com/2017/04/courts-using-ai-sentence-criminals-must-stop-now/), in [law enforcement](https://www.washingtonpost.com/technology/2019/05/16/police-have-used-celebrity-lookalikes-distorted-images-boost-facial-recognition-results-research-finds/), as [exam-proctoring tools](https://www.wired.com/story/ai-college-exam-proctors-surveillance/) (also [this](https://www.eff.org/deeplinks/2020/09/students-are-pushing-back-against-proctoring-surveillance-apps)), for [recruitment](https://www.washingtonpost.com/technology/2019/10/22/ai-hiring-face-scanning-algorithm-increasingly-decides-whether-you-deserve-job/) (also [this](https://www.technologyreview.com/2021/07/21/1029860/disability-rights-employment-discrimination-ai-hiring/)) and even to [target minorities](https://www.insider.com/china-is-testing-ai-recognition-on-the-uighurs-bbc-2021-5). Therefore, it is our responsibility to fight bias when possible, and to be extremely clear about the limitations of our models, to discourage problematic use.
267
 
268
  ### Bias examples (Spanish)
269
 
270
+ Note that this analysis is slightly more difficult to do in Spanish since gender concordance reveals hints beyond masks. Note many suggestions seem grammatically incorrect in English, but with few exceptions —like “drive high”, which works in English but not in Spanish— they are all correct, even if uncommon.
271
 
272
+ Results show that bias is apparent even in a quick and shallow analysis like this one. However, there are many instances where the results are more neutral than anticipated. For instance, the first option to do the dishes is the son”, and pink is nowhere to be found in the colour recommendations for a girl. Women seem to drive “high”, fast”, strong and well”, but “not a lot”.
273
 
274
+ But before we get complacent, the model reminds us that the place of the woman is at "home" or "the bed" (!), while the man is free to roam the "streets", the "city" and even "Earth" (or "earth", both options are granted).
275
 
276
+ Similar conclusions are derived from examples focusing on race and religion. Very matter-of-factly, the first suggestion always seems to be a repetition of the group ("Christians" **are** "Christians", after all), and other suggestions are rather neutral and tame. However, there are some worrisome proposals. For example, the fourth option for Jews is that they are "racist". Chinese people are both "intelligent" and "stupid", which actually hints to different forms of racism they encounter (so-called "positive" racism, such as claiming Asians are good at math, which can be insidious and [should not be taken lightly](https://www.health.harvard.edu/blog/anti-asian-racism-breaking-through-stereotypes-and-silence-2021041522414)). Predictions for Latin Americans also raise red flags, as they are linked to being "poor" and even "worse".
277
 
278
  The model also seems to suffer from geographical bias, producing words that are more common in Spain than other countries. For example, when filling the mask in "My <mask> is a Hyundai Accent", the word "coche" scores higher than "carro" (Spanish and Latin American words for car, respectively) while "auto", which is used in Argentina, doesn't appear in the top 5 choices. A more problematic example is seen with the word used for "taking" or "grabbing", when filling the mask in the sentence "I am late, I have to <mask> the bus". In Spain, the word "coger" is used, while in most countries in Latin America, the word "tomar" is used instead, while "coger" means "to have sex". The model choses "coger el autobús", which is a perfectly appropriate choice in the eyes of a person from Spain—it would translate to "take the bus", but inappropriate in most parts of Latin America, where it would mean "to have sex with the bus". Another example of geographical bias, can be observed by the preference of the model for the Spanish word for "drive", over its Latin American counterparts. Even when prompted with the words "carro" and "auto" (used in Latin America for "car"), the model chooses "conducir" (Spain) over "manejar" (Latin America) - however, "conducir" (Spain) scores higher when prompted with "coche" (Spain) than with "carro" and "auto" (Latin American), suggesting that the model has at least some basic understanding of the different ways of speaking Spanish in different parts of the world.
279
 
420
 
421
  The performance of our models has been, in general, very good. Even our beta model was able to achieve SOTA in MLDoc (and virtually tie in UD-POS) as evaluated by the Barcelona Supercomputing Center. In the main masked-language task our models reach values between 0.65 and 0.69, which foretells good results for downstream tasks.
422
 
423
+ Our analysis of downstream tasks is not yet complete. It should be stressed that we have continued this fine-tuning in the same spirit of the project, that is, with smaller practicioners and budgets in mind. Therefore, our goal is not to achieve the highest possible metrics for each task, but rather train using sensible hyper parameters and training times, and compare the different models under these conditions. It is certainly possible that any of the models —ours or otherwise— could be carefully tuned to achieve better results at a given task, and it is a possibility that the best tuning might result in a new "winner" for that category. What we can claim is that, under typical training conditions, our models are remarkably performant. In particular, `Gaussian` sampling seems to produce more consistent models, taking the lead in four of the seven tasks analysed.
424
 
425
+ The differences in performance for models trained using different data-sampling techniques are consistent. `Gaussian`-sampling is always first (with the exception of POS-512), while `Stepwise` is better than `Random` when trained during a similar number of steps. This proves that the sampling technique is, indeed, relevant. A more thorough statistical analysis is still required.
426
 
427
+ As already mentiond in the [Training details](#training-details) section, the methodology used to extend sequence length during training is critical. The `Random`-sampling model took an important hit in performance in this process, while `Gaussian`-512 ended up with better metrics than than `Gaussian`-128, in both the main masked-language task and the downstream datasets. The key difference was that `Random` kept the optimizer intact while `Gaussian` used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.
428
 
429
  # Lessons and next steps
430
 
431
+ BERTIN Project has been a challenge for many reasons. Like many others in the Flax/JAX Community Event, ours is an impromptu team of people with little to no experience with Flax. Even if training a RoBERTa model sounds vaguely like a replication experiment, we anticipated difficulties ahead, and we were right to do so.
432
 
433
+ New tools always require a period of adaptation in the working flow. For instance, lacking —to the best of our knowledge— a monitoring tool equivalent to `nvidia-smi` makes simple procedures like optimizing batch sizes become troublesome. Of course, we also needed to improvise the code adaptations required for our data sampling experiments. Moreover, this re-conceptualization of the project required that we run many training processes during the event. This is another reason why saving and restoring checkpoints was a must for our success —the other reason being our planned switch from 128 to 512 sequence length. However, such code was not available at the start of the Community Event. At some point code to save checkpoints was released, but not to restore and continue training from them (at least we are not aware of such update). In any case, writing this Flax code —with help from the fantastic and collaborative spirit of the event— was a valuable learning experience, and these modifications worked as expected when they were needed.
434
 
435
  The results we present in this project are very promising, and we believe they hold great value for the community as a whole. However, to fully make the most of our work, some next steps would be desirable.
436
 
437
+ The most obvious step ahead is to replicate training on a "large" version of the model. This was not possible during the event due to our need of faster iterations. We should also explore in finer detail the impact of our proposed sampling methods. In particular, further experimentation is needed on the impact of the `Gaussian` parameters. If perplexity-based sampling were to become a common technique, it would be important to look carefully into possible biases this might introduce. Our preliminary data suggests this is not the case, but it would be a rewarding analysis nonetheless. Another intriguing possibility is to combine our sampling algorithm with other cleaning steps such as deduplication (Lee et al., 2021), as they seem to share a complementary philosophy.
 
438
 
439
  # Conclusions
440
 
441
+ With roughly 10 days worth of access to 3 TPUv3-8, we have achieved remarkable results surpassing previous state of the art in a few tasks, and even improving document classification on models trained in massive supercomputers with very large, highly-curated, and in some cases private, datasets.
442
 
443
+ The very big size of the datasets available looked enticing while formulating the project. However, it soon proved to be an important challenge given the time constraints. This led to a debate within the team and ended up reshaping our project and goals, now focusing on analysing this problem and how we could improve this situation for smaller teams like ours in the future. The subsampling techniques analysed in this report have shown great promise in this regard, and we hope to see other groups use them and improve them in the future.
444
 
445
+ At a personal level, the experience has been incredible for all of us. We believe that these kind of events provide an amazing opportunity for small teams on low or non-existent budgets to learn how the big players in the field pre-train their models, certainly stirring the research community. The trade-off between learning and experimenting, and being beta-testers of libraries (Flax/JAX) and infrastructure (TPU VMs) is a marginal cost to pay compared to the benefits such access has to offer.
446
 
447
  Given our good results, on par with those of large corporations, we hope our work will inspire and set the basis for more small teams to play and experiment with language models on smaller subsets of huge datasets.
448