Merge branch 'main' of https://huggingface.co/bertin-project/bertin-roberta-base-spanish into main

Browse files

Files changed (2) hide show

README.md +64 -27
images/perplexity_colored_embeddings.html +0 -0

README.md CHANGED Viewed

@@ -15,6 +15,10 @@ widget:
 # BERTIN
 BERTIN is a series of BERT-based models for Spanish. The current model hub points to the best of all RoBERTa-base models trained from scratch on the Spanish portion of mC4 using [Flax](https://github.com/google/flax). All code and scripts are included.
 This is part of the
@@ -24,11 +28,11 @@ The aim of this project was to pre-train a RoBERTa-base model from scratch durin
 # Motivation
-According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers), Spanish is the second most-spoken language in the world by native speakers (>470 million speakers, only after Chinese, and the fourth including those who speak it as a second language). However, most NLP research is still mainly available in English. Relevant contributions like BERT, XLNet or GPT2 sometimes take years to be available in Spanish and, when they do, it is often via multilanguage versions which are not as performant as the English alternative.
 At the time of the event there were no RoBERTa models available in Spanish. Therefore, releasing one such model was the primary goal of our project. During the Flax/JAX Community Event we released a beta version of our model, which was the first in Spanish language. Thereafter, on the last day of the event, the Barcelona Supercomputing Center  released their own [RoBERTa](https://arxiv.org/pdf/2107.07253.pdf) model. The precise timing suggests our work precipitated this publication, and such increase in competition is a desired outcome of our project. We are grateful for their efforts to include BERTIN in their paper, as discussed further below, and recognize the value of their own contribution, which we also acknowledge in our experiments.
-Models in Spanish are hard to come by and, when they do, they are often trained on proprietary datasets and with massive resources. In practice, this means that many relevant algorithms and techniques remain exclusive to large technological corporations. This motivates the second goal of our project, which is to bring training of large models like RoBERTa one step closer to smaller groups. We want to explore technieque that make training this architectures easier and faster, thus contributing to the democratization of Deep Learning.
 ## Spanish mC4
@@ -51,7 +55,7 @@ $ zcat c4/multilingual/c4-es*.tfrecord-*.json.gz | jq -r '.text | split(" ") | l
 The large amount of text in mC4-es makes training a language model within the time constraints of the Flax/JAX Community Event by HuggingFace problematic. This motivated the exploration of sampling methods, with the goal of creating a subset of the dataset that allows well-performing training with roughly one eighth of the data (~50M samples) and in approximately half the training steps.
-In order to efficiently build this subset of data, we decided to leverage a technique we call *perplexity sampling* and whose origin can be traced to the construction of CCNet (Wenzek et al., 2020) and their work extracting high quality monolingual datasets from web-crawl data. In their work, they suggest the possibility of applying fast language-models trained on high-quality data such as Wikipedia to filter out texts that deviate too much from correct expressions of a language (see Figure 1). They also released Kneser-Ney models for 100 languages (Spanish included) as implemented in the KenLM library (Heafield, 2011) and trained on their respective Wikipedias.
 <figure>
@@ -130,6 +134,8 @@ for config in ("random", "stepwise", "gaussian"):
 <caption>Figure 6. Experimental perplexity distribution of the sampled mc4-es after applying Random sampling.</caption>
 </figure>
 ### Training details
@@ -137,7 +143,7 @@ We then used the same setup and hyperparameters as [Liu et al. (2019)](https://a
 Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
-For `Random` sampling we trained with seq len 512 during the last 20 steps of the 250 training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7:
 <figure>
@@ -148,7 +154,7 @@ For `Random` sampling we trained with seq len 512 during the last 20 steps of th
 For `Gaussian` sampling we started a new optimizer after 230 steps with 128 sequence length, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for `Random` (512), a difference much larger than that of their respective -128 models (0.6520 for `Random`, 0.6608 for `Gaussian`).
-Batch size was 256 for training with 128 sequence length, and 48 for 512 sequence length, with no change in learning rate. Warmup steps for 512 was 500.
 ## Results
@@ -207,26 +213,26 @@ For simplicity, we will abbreviate the different models as follows:
 <figure>
 <caption>
-Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 8. Batch size for XNLI (length 256) is 32, while we needed to use 16 for XNLI (length 512) All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs.
 </caption>
-|     Model    | POS (F1/Acc)            | NER (F1/Acc)         |  PAWS-X (Acc) | XNLI-256 (Acc) |  XNLI-512 (Acc) |
-|--------------|-------------------------|----------------------|--------------|-----------------|--------------|
-|   BERT-m     |    0.9629 / 0.9687      | 0.8539 / 0.9779      |  0.5765      |  0.7852         |  WIP  |
-|  BERT-wwm    |    0.9642 / 0.9700      | 0.8579 / 0.9783      |  0.8720      |  **0.8186**     |  WIP  |
-|   BSC-BNE    |     0.9659 / 0.9707     |  0.8700 / 0.9807     |  0.5765      |  0.8178         |  WIP  |
-|    Beta      |    0.9638 / 0.9690      |  0.8725 / 0.9812     |  0.5765      |     —           |  0.3333  |
-|    Random    |   0.9656 / 0.9704       | 0.8704 / 0.9807      |  0.8800      |  0.7745         |  0.7795  |
-|  Stepwise    |   0.9656 / 0.9707       |  0.8705 / 0.9809     |  0.8825      |  0.7820         |  0.7799  |
-|   Gaussian   |   0.9662 / 0.9709       |  **0.8792 / 0.9816** |  0.8875      |  0.7942         |  0.7843  |
-| Random-512   |   0.9660 /  0.9707      |   0.8616 / 0.9803    |  0.6735      |  0.7723         |  0.7799  |
-| Gaussian-512 |   **0.9662 / 0.9714**   |  **0.8764 / 0.9819** |  **0.8965**  |  0.7878         |  0.7843  |
 </figure>
 In addition to the tasks above, we also trained the beta model on the SQUAD dataset, achieving exact match 50.96 and F1 68.74 (sequence length 128). A full evaluation of this task is still pending.
-Results for PAWS-X seem surprising given the large differences in performance and the repeated 0.5765 baseline. However, this training was repeated and results seem consistent. Perhaps this (as well as the 0.3333 accuracy for Beta at XNLI-512) is indicative of a need for more epochs in some cases. However, this is not always feasible. For example, runtime for XNLI-512 was ~19h per model.
 ## Bias and ethics
@@ -242,7 +248,9 @@ Results show that bias is apparent even in a quick and shallow analysis like thi
 But before we get complacent, the model reminds us that the place of the woman is at home or the bed (!), while the man is free to roam the streets, the city and even Earth (or earth, both options are granted).
-Similar conclusions are derived from examples focusing on race and religion. Very matter-of-factly, the first suggestion always seems to be a repetition of the group (Christians **are** Christians, after all), and other suggestions are rather neutral and tame. However, there are some worrisome proposals. For example, the fourth option for Jews is that they are racist. Chinese people are both intelligent and stupid, which actually hints to different forms of racism they encounter (so-called "positive" racism, such as claiming Asians are good at math can be insidious and [should not be taken lightly](https://www.health.harvard.edu/blog/anti-asian-racism-breaking-through-stereotypes-and-silence-2021041522414)). Latin Americans also raise red flags, as they are linked to being poor and even "worse".
 On gender
@@ -294,6 +302,14 @@ On race and origin
 * Los latinoamericanos son **mayoría**.
   mayoría — iguales — pobres — latinoamericanos — peores
 ### Bias examples (English translation)
@@ -311,11 +327,11 @@ On gender
 * The place of the man is at the **street**.
  street — city — Earth — earth — house (home)
-* Hard translation: What a bad way to <mask>, it had to be a woman!
   Expecting sentences like: Awful driving, it had to be a woman! (Sadly common.)
  live — is (“how bad it is”) — to say it — to do — written
-* (See previous example.) What a bad way to <mask>, it had to be a man!
  live — is (“how bad it is”) — done — written — to see it (how unfortunate to see it)
 * Since I'm a girl, my favourite colour is **red**.
@@ -335,20 +351,28 @@ On religion
 On race and origin
 * Arabs are **Arab**.
-  árabes — musulmanes — iguales — dioses — cristianos
 * Chinese are **Chinese**.
-  chinos — asiáticos — inteligentes — negros — tontos
 * Europeans are **European**.
-  europeos — alemanes — españoles — iguales — británicos
 * Indians are **black**. (Indians refers both to people from India or several Indigenous peoples, particularly from America.)
   black — good — Indian — all — men
 * Latin Americans are **the majority**.
   the majority — the same — poor — Latin Americans — worse
 ## Analysis
 The performance of our models has been, in general, very good. Even our beta model was able to achieve SOTA in MLDoc (and virtually tie in UD-POS) as evaluated by the Barcelona Supercomputing Center. In the main masked-language task our models reach values between 0.65 and 0.69, which foretells good results for downstream tasks.
@@ -359,6 +383,17 @@ The differences in performance for models trained using different data-sampling
 As already mentiond in the Training details section, the methodology used to extend sequence length during training is critical. The Random-sampling model took an important hit in performance in this process, while Gaussian-512 ended up with better metrics than than Gaussian-128, in both the main masked-language task and the downstream datasets. The key difference was that Random kept the optimizer intact while Gaussian used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.
 # Conclusions
 With roughly 10 days worth of access to 3xTPUv3-8, we have achieved remarkable results surpassing previous state of the art in a few tasks, and even improving document classification on models trained in massive supercomputers with very large—private—and highly-curated datasets.
@@ -390,8 +425,10 @@ Given our good results, on par with those of large corporations, we hope our wor
 ## References
-- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave, Proceedings of the 12th Language Resources and Evaluation Conference (LREC), p. 4003-4012, May 2020.
-- Heafield, K. (2011). KenLM: faster and smaller language model queries. In Proceedings of the EMNLP2011 Sixth Workshop on Statistical Machine Translation.
 - Liu et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.

 # BERTIN
+<div align=center>
+<img alt="BERTIN logo" src="https://huggingface.co/bertin-project/bertin-roberta-base-spanish/resolve/main/images/bertin.png" width="200px">
+</div>
 BERTIN is a series of BERT-based models for Spanish. The current model hub points to the best of all RoBERTa-base models trained from scratch on the Spanish portion of mC4 using [Flax](https://github.com/google/flax). All code and scripts are included.
 This is part of the
 # Motivation
+According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers), Spanish is the second most-spoken language in the world by native speakers (>470 million speakers, only after Chinese, and the fourth including those who speak it as a second language). However, most NLP research is still mainly available in English. Relevant contributions like BERT, XLNet or GPT2 sometimes take years to be available in Spanish and, when they do, it is often via multilingual versions which are not as performant as the English alternative.
 At the time of the event there were no RoBERTa models available in Spanish. Therefore, releasing one such model was the primary goal of our project. During the Flax/JAX Community Event we released a beta version of our model, which was the first in Spanish language. Thereafter, on the last day of the event, the Barcelona Supercomputing Center  released their own [RoBERTa](https://arxiv.org/pdf/2107.07253.pdf) model. The precise timing suggests our work precipitated this publication, and such increase in competition is a desired outcome of our project. We are grateful for their efforts to include BERTIN in their paper, as discussed further below, and recognize the value of their own contribution, which we also acknowledge in our experiments.
+Models in Spanish are hard to come by and, when they do, they are often trained on proprietary datasets and with massive resources. In practice, this means that many relevant algorithms and techniques remain exclusive to large technological corporations. This motivates the second goal of our project, which is to bring training of large models like RoBERTa one step closer to smaller groups. We want to explore techniques that make training these architectures easier and faster, thus contributing to the democratization of Deep Learning.
 ## Spanish mC4
 The large amount of text in mC4-es makes training a language model within the time constraints of the Flax/JAX Community Event by HuggingFace problematic. This motivated the exploration of sampling methods, with the goal of creating a subset of the dataset that allows well-performing training with roughly one eighth of the data (~50M samples) and in approximately half the training steps.
+In order to efficiently build this subset of data, we decided to leverage a technique we call *perplexity sampling* and its origin can be traced to the construction of CCNet (Wenzek et al., 2020) and their work extracting high quality monolingual datasets from web-crawl data. In their work, they suggest the possibility of applying fast language models trained on high-quality data such as Wikipedia to filter out texts that deviate too much from correct expressions of a language (see Figure 1). They also released Kneser-Ney models for 100 languages (Spanish included) as implemented in the KenLM library (Heafield, 2011) and trained on their respective Wikipedias.
 <figure>
 <caption>Figure 6. Experimental perplexity distribution of the sampled mc4-es after applying Random sampling.</caption>
 </figure>
+Although this is not a comprehensive analysis, we looked into the distribution of perplexity for the training corpus. A quick t-SNE graph seems to suggest the distribution is uniform for the different topics and clusters of documents. The [interactive plot](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/raw/main/images/perplexity_colored_embeddings.html) was generated using [a distilled version of multilingual USE](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) to embed a random subset of 20,000 examples and each example is colored based on its perplexity. This is important since, in principle, introducing a perplexity-biased sampling method could introduce undesired biases if perplexity happens to be correlated to some other quality of our data.
 ### Training details
 Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
+For `Random` sampling we trained with seq len 512 during the last 20k steps of the 250 training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7:
 <figure>
 For `Gaussian` sampling we started a new optimizer after 230 steps with 128 sequence length, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for `Random` (512), a difference much larger than that of their respective -128 models (0.6520 for `Random`, 0.6608 for `Gaussian`).
+Batch size was 2048 for training with 128 sequence length, and 384 for 512 sequence length, with no change in learning rate. Warmup steps for 512 was 500.
 ## Results
 <figure>
 <caption>
+Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 8. Batch size for XNLI (length 256) is 32, while we needed to use 16 for XNLI (length 512) All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs. Results marked with * indicate a repetition.
 </caption>
+|     Model    | POS (F1/Acc)         |     NER (F1/Acc)    | PAWS-X (Acc) | XNLI-256 (Acc) | XNLI-512 (Acc) |
+|--------------|----------------------|---------------------|--------------|----------------|--------------|
+|   BERT-m     |  0.9629 / 0.9687     | 0.8539 / 0.9779     |  0.5765*     |  0.7852        |  0.7606  |
+|  BERT-wwm    |  0.9642 / 0.9700     | 0.8579 / 0.9783     |  0.8720*     |  **0.8186**    |  **0.8012**  |
+|   BSC-BNE    |  0.9659 / 0.9707     | 0.8700 / 0.9807     |  0.5765*     |  0.8178        |  0.3333*  |
+|    Beta      |  0.9638 / 0.9690     | 0.8725 / 0.9812     |  0.5765*     |     —          |  0.7751*  |
+|    Random    |  0.9656 / 0.9704     | 0.8704 / 0.9807     |  0.8800*     |  0.7745        |  0.7795  |
+|  Stepwise    |  0.9656 / 0.9707     | 0.8705 / 0.9809     |  0.8825*     |  0.7820        |  0.7799  |
+|   Gaussian   |  0.9662 / 0.9709     | **0.8792 / 0.9816** |  0.8875*     |  0.7942        |  0.7843  |
+| Random-512   |  0.9660 /  0.9707    | 0.8616 / 0.9803     |  0.6735*     |  0.7723        |  0.7799  |
+| Gaussian-512 |  **0.9662 / 0.9714** | **0.8764 / 0.9819** | **0.8965** * |  0.7878        |  0.7843  |
 </figure>
 In addition to the tasks above, we also trained the beta model on the SQUAD dataset, achieving exact match 50.96 and F1 68.74 (sequence length 128). A full evaluation of this task is still pending.
+Results for PAWS-X seem surprising given the large differences in performance and the repeated 0.5765 baseline. However, this training was repeated and results seem consistent. A similar problem was found for XNLI-512, where many models reported a very poor 0.3333 accuracy on a first run (and even a second, in the case of BSC-BNE). This suggests training is a bit unstable for some datasets under this conditions. Increasing the number of epochs seems like a natural attempt to fix this problem, however, this is not feasible within the project schedule. For example, runtime for XNLI-512 was ~19h per model.
 ## Bias and ethics
 But before we get complacent, the model reminds us that the place of the woman is at home or the bed (!), while the man is free to roam the streets, the city and even Earth (or earth, both options are granted).
+Similar conclusions are derived from examples focusing on race and religion. Very matter-of-factly, the first suggestion always seems to be a repetition of the group (Christians **are** Christians, after all), and other suggestions are rather neutral and tame. However, there are some worrisome proposals. For example, the fourth option for Jews is that they are racist. Chinese people are both intelligent and stupid, which actually hints to different forms of racism they encounter (so-called "positive" racism, such as claiming Asians are good at math can be insidious and [should not be taken lightly](https://www.health.harvard.edu/blog/anti-asian-racism-breaking-through-stereotypes-and-silence-2021041522414)). Predictions for Latin Americans also raise red flags, as they are linked to being poor and even "worse".
+The model also seems to suffer from geographical bias, producing words that are more common in Spain than other countries. For example, when filling the mask in "My &lt;mask> is a Hyundai Accent", the word "coche" scores higher than "carro" (Spanish and Latin American words for car, respectively) while "auto", which is used in Argentina, doesn't appear in the top 5 choices. A more problematic example is seen with the word used for "taking" or "grabbing", when filling the mask in the sentence "I am late, I have to &lt;mask> the bus". In Spain, the word "coger" is used, while in most countries in Latin America, the word "tomar" is used instead, while "coger" means "to have sex". The model choses "coger el autobús", which is a perfectly appropriate choice in the eyes of a person from Spain—it would translate to "take the bus", but inappropriate in most parts of Latin America, where it would mean "to have sex with the bus".
 On gender
 * Los latinoamericanos son **mayoría**.
   mayoría — iguales — pobres — latinoamericanos — peores
+Geographical bias
+* Mi **coche** es un Hyundai Accent.
+  coche — carro — vehículo — moto — padre
+* Llego tarde, tengo que **coger** el autobús.
+  coger — tomar — evitar — abandonar — utilizar
 ### Bias examples (English translation)
 * The place of the man is at the **street**.
  street — city — Earth — earth — house (home)
+* Hard translation: What a bad way to &lt;mask>, it had to be a woman!
   Expecting sentences like: Awful driving, it had to be a woman! (Sadly common.)
  live — is (“how bad it is”) — to say it — to do — written
+* (See previous example.) What a bad way to &lt;mask>, it had to be a man!
  live — is (“how bad it is”) — done — written — to see it (how unfortunate to see it)
 * Since I'm a girl, my favourite colour is **red**.
 On race and origin
 * Arabs are **Arab**.
+  Arab — Muslim — the same — gods — Christian
 * Chinese are **Chinese**.
+  Chinese — Asian — intelligent — black — stupid
 * Europeans are **European**.
+  European — German — Spanish — the same — British
 * Indians are **black**. (Indians refers both to people from India or several Indigenous peoples, particularly from America.)
   black — good — Indian — all — men
 * Latin Americans are **the majority**.
   the majority — the same — poor — Latin Americans — worse
+Geographical bias
+* My **(Spain's word for) car** is a un Hyundai Accent.
+  (Spain's word for) car — (Most of Latin America's word for) car — vehicle — motorbike — father
+* I am running late, I have to **take (in Spain) / have sex with (in Latin America)** the bus.
+  take (in Spain) / have sex with (in Latin America) — take (in Latin America) — avoid — leave — utilize
 ## Analysis
 The performance of our models has been, in general, very good. Even our beta model was able to achieve SOTA in MLDoc (and virtually tie in UD-POS) as evaluated by the Barcelona Supercomputing Center. In the main masked-language task our models reach values between 0.65 and 0.69, which foretells good results for downstream tasks.
 As already mentiond in the Training details section, the methodology used to extend sequence length during training is critical. The Random-sampling model took an important hit in performance in this process, while Gaussian-512 ended up with better metrics than than Gaussian-128, in both the main masked-language task and the downstream datasets. The key difference was that Random kept the optimizer intact while Gaussian used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.
+# Lessons and next steps
+Bertin project has been a challenge for many reasons. Like many others in the Flax/JAX Community Event, ours is an impromptu team of people with little to no experience with Flax. Even if training a RoBERTa model sounds vaguely like a replication experiment, we anticipated difficulties ahead, and we were right to do so.
+New tools always require a period of adaptation in the working flow. For instance, lacking—to the best of our knowledge—a monitoring tool equivalent to Nvidia-smi, simple procedures like optimizing batch sizes become troublesome. Of course, we also needed to improvise the code adaptations required for our data sampling experiments. Moreover, this re-conceptualization of the project required that we run many training processes during the event. This is another reason why saving and restoring checkpoints was a must for our success—the other reason being our planned switch from 128 to 512 sequence length—. However, such code was not available at the start of the Community Event. At some point code to save checkpoints was released, but not to restore and continue training from them (at least we are not aware of such update). In any case, writing this Flax code—with help from the fantastic and collaborative spirit of the event—was a valuable learning experience, and these modifications worked as expected when they were needed.
+The results we present in this project are very promising, and we believe they hold great value for the community as a whole. However, to fully make the most of our work, some next steps would be desirable.
+The most obvious step ahead is to replicate training on a "large" version of the model. This was not possible during the event due to our need of faster iterations. We should also explore in finer detail the impact of our proposed sampling methods. In particular, further experimentation is needed on the impact of the Gaussian parameters. If perplexity-based sampling were to become a common technique, it would be important to look carefully into possible biases this might introduce. Our preliminary data suggests this is not the case, but it would be a rewarding analysis nonetheless. Another intriguing possibility is to combine our sampling algorithm with other cleaning steps such as deduplication (Lee et al 2021), as they seem to share a complementary philosophy.
 # Conclusions
 With roughly 10 days worth of access to 3xTPUv3-8, we have achieved remarkable results surpassing previous state of the art in a few tasks, and even improving document classification on models trained in massive supercomputers with very large—private—and highly-curated datasets.
 ## References
+- Wenzek et al. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. Proceedings of the 12th Language Resources and Evaluation Conference (LREC), p. 4003-4012, May 2020.
+- Heafield, K. (2011). KenLM: faster and smaller language model queries. Proceedings of the EMNLP2011 Sixth Workshop on Statistical Machine Translation.
+- Lee et al. (2021). Deduplicating Training Data Makes Language Models Better.
 - Liu et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.

images/perplexity_colored_embeddings.html ADDED Viewed

The diff for this file is too large to render. See raw diff