bertin-project
/

bertin-roberta-base-spanish

@@ -52,7 +52,7 @@ $ zcat c4/multilingual/c4-es*.tfrecord-*.json.gz | jq -r '.text | split(" ") | l
 ## Perplexity sampling
-The large amount of text in mC4-es makes training a language model within the time constraints of the Flax/JAX Community Event problematic. This motivated the exploration of sampling methods, with the goal of creating a subset of the dataset that would allow for the taining of well-performing models with roughly one eighth of the data (~50M samples) and at approximately half the training steps.
 In order to efficiently build this subset of data, we decided to leverage a technique we call *perplexity sampling*, and whose origin can be traced to the construction of CCNet (Wenzek et al., 2020) and their high quality monolingual datasets from web-crawl data. In their work, they suggest the possibility of applying fast language models trained on high-quality data such as Wikipedia to filter out texts that deviate too much from correct expressions of a language (see Figure 1). They also released Kneser-Ney models (Ney et al., 1994) for 100 languages (Spanish included) as implemented in the KenLM library (Heafield, 2011) and trained on their respective Wikipedias.
@@ -140,9 +140,9 @@ Although this is not a comprehensive analysis, we looked into the distribution o
 We then used the same setup and hyperparameters as [Liu et al. (2019)](https://arxiv.org/abs/1907.11692) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` and `Stepwise` trained for the 250k steps, while `Random` was stopped at 230k. `Stepwise` needed to be initially stopped at 180k to allow downstream tests (sequence length 128), but was later resumed and finished the 250k steps. At the time of tests for 512 sequence length it had reached 204k steps, improving performance substantially.
-Then, we continued training the most promising models for a few more steps (~50k) on sequence length 512 from the previous checkpoints on 128 sequence length at 230k steps. We tried two strategies for this, since it is not easy to find clear details about how to procede in the literature. It turns out this decision had a big impact in the final performance.
-For `Random` sampling we trained with seq len 512 during the last 25k steps of the 250k training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7.
 <figure>
@@ -159,13 +159,13 @@ Batch size was 2048 (8 TPU cores \* 256 batch size) for training with 128 sequen
 Please refer to the **evaluation** folder for training scripts for downstream tasks.
-Our first test, tagged [`beta`](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/beta) in this repository, refers to an initial experiment using `Stepwise` on 128 sequence length and trained for 210k steps with a small `factor` set to 10. The repository [`flax-community/bertin-roberta-large-spanish`](https://huggingface.co/flax-community/bertin-roberta-large-spanish) containes a nearly identical version but it is now discontinued). During the community event, the Barcelona Supercomputing Center (BSC) in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that were further cleaned up to the final 570GB. This is an interesting contrast to our own resources (3 TPUv3-8 for 10 days to do cleaning, sampling, taining, and evaluation) and makes for a valuable reference. The BSC team evaluated our early release of the model [`beta`](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/beta) and the results can be seen in Table 1.
 Our final models were trained on a different number of steps and sequence lengths and achieve different—higher—masked-word prediction accuracies. Despite these limitations it is interesting to see the results they obtained using the early version of our model. Note that some of the datasets used for evaluation by BSC are not freely available, therefore it is not possible to verify the figures.
 <figure>
-<caption>Table 1. Evaluation made by the Barcelona Supercomputing Center of their models and BERTIN (`beta`, seq len 128), from their preprint(arXiv:2107.07253).</caption>
 | Dataset     | Metric   | RoBERTa-b | RoBERTa-l | BETO   | mBERT  | BERTIN (beta) |
 |-------------|----------|-----------|-----------|--------|--------|--------|
@@ -176,7 +176,7 @@ Our final models were trained on a different number of steps and sequence length
 | STS         | Combined |    0.8423 |    0.8420 | 0.8216 | 0.8249 | 0.7822 |
 | MLDoc       | Accuracy |    0.9595 |    0.9600 | 0.9650 | 0.9560 | **0.9673** |
 | PAWS-X      | F1       |    0.9035 |    0.9000 | 0.8915 | 0.9020 | 0.8820 |
-| XNLI        | Accuracy |    0.8016 |       WiP | 0.8130 | 0.7876 |    WiP |
 </figure>
@@ -193,6 +193,7 @@ All of our models attained good accuracy values during training in the masked-la
 | [`bertin-project/bertin-base-stepwise`](https://huggingface.co/bertin-project/bertin-base-stepwise)                | 0.6487   |
 | [`bertin-project/bertin-base-gaussian`](https://huggingface.co/bertin-project/bertin-base-gaussian)                | 0.6608   |
 | [`bertin-project/bertin-base-random-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-random-exp-512seqlen)    | 0.5907   |
 | [`bertin-project/bertin-base-gaussian-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-gaussian-exp-512seqlen)  | **0.6873**   |
 </figure>
@@ -209,12 +210,13 @@ For simplicity, we will abbreviate the different models as follows:
 * **Stepwise**: [`bertin-project/bertin-base-stepwise`](https://huggingface.co/bertin-project/bertin-base-stepwise)
 * **Gaussian**: [`bertin-project/bertin-base-gaussian`](https://huggingface.co/bertin-project/bertin-base-gaussian)
 * **Random-512**: [`bertin-project/bertin-base-random-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-random-exp-512seqlen)
 * **Gaussian-512**: [`bertin-project/bertin-base-gaussian-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-gaussian-exp-512seqlen)
 <figure>
 <caption>
-Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS and NER used max length 128 and batch size 16. Batch size for XNLI is 32 (max length 256). All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs. Stepwise used an older checkpoint with only 180.000 steps.
 </caption>
 |     Model    | POS (F1/Acc)         |     NER (F1/Acc)    | XNLI-256 (Acc) |
@@ -227,6 +229,7 @@ Table 3. Metrics for different downstream tasks, comparing our different models
 |  Stepwise    |  0.9656 / 0.9707     | 0.8705 / 0.9809     |  0.7820        |
 |  Gaussian    |  0.9662 / 0.9709     | **0.8792 / 0.9816** |  0.7942        |
 | Random-512   |  0.9660 /  0.9707    | 0.8616 / 0.9803     |  0.7723        |
 | Gaussian-512 |  **0.9662 / 0.9714** | **0.8764 / 0.9819** |  0.7878        |
 </figure>
@@ -244,6 +247,7 @@ Table 4. Metrics for different downstream tasks, comparing our different models
 |  Stepwise    |  0.9642 / 0.9693     | 0.8726 / 0.9818     |  0.8825*     |  0.7799    |
 |   Gaussian   |  0.9644 / 0.9692     | **0.8779 / 0.9820** |  0.8875*     |  0.7843    |
 | Random-512   |  0.9636 /  0.9690    | 0.8664 / 0.9806     |  0.6735*     |  0.7799    |
 | Gaussian-512 |  0.9646 / 0.9697     | 0.8707 / 0.9810     | **0.8965** * |  0.7843    |
 </figure>
@@ -252,7 +256,7 @@ In addition to the tasks above, we also trained the [`beta`](https://huggingface
 Results for PAWS-X seem surprising given the large differences in performance. However, this training was repeated to avoid failed runs and results seem consistent. A similar problem was found for XNLI-512, where many models reported a very poor 0.3333 accuracy on a first run (and even a second, in the case of BSC-BNE). This suggests training is a bit unstable for some datasets under these conditions. Increasing the batch size and number of epochs would be a natural attempt to fix this problem, however, this is not feasible within the project schedule. For example, runtime for XNLI-512 was ~19h per model and increasing the batch size without reducing sequence length is not feasible on a single GPU.
-We are also releasing the fine-tuned models for `Gaussian`-512 and making it our version v1 default to 128 sequence length since it experimentally shows better performance on fill-mask task, while alse releasing the 512 sequence length version (v1-512) for fine-tuning.
 - POS: [`bertin-project/bertin-base-pos-conll2002-es`](https://huggingface.co/bertin-project/bertin-base-pos-conll2002-es/)
 - NER: [`bertin-project/bertin-base-ner-conll2002-es`](https://huggingface.co/bertin-project/bertin-base-ner-conll2002-es/)
@@ -269,7 +273,7 @@ Even if a rigorous analysis of bias is difficult, we should not use that excuse
 Note that this analysis is slightly more difficult to do in Spanish since gender concordance reveals hints beyond masks. Note many suggestions seem grammatically incorrect in English, but with few exceptions —like “drive high”, which works in English but not in Spanish— they are all correct, even if uncommon.
-Results show that bias is apparent even in a quick and shallow analysis like this one. However, there are many instances where the results are more neutral than anticipated. For instance, the first option to “do the dishes” is the “son”, and “pink” is nowhere to be found in the colour recommendations for a girl. Women seem to drive “high”, “fast”, “strong” and “well”, but “not a lot”.
 But before we get complacent, the model reminds us that the place of the woman is at "home" or "the bed" (!), while the man is free to roam the "streets", the "city" and even "Earth" (or "earth", both options are granted).
@@ -401,7 +405,7 @@ On race and origin
 Geographical bias
-* My **(Spain's word for) car** is a un Hyundai Accent.
   (Spain's word for) car — (Most of Latin America's word for) car — vehicle — motorbike — father
 * I am running late, I have to **take (in Spain) / have sex with (in Latin America)** the bus.
@@ -424,7 +428,7 @@ Our analysis of downstream tasks is not yet complete. It should be stressed that
 The differences in performance for models trained using different data-sampling techniques are consistent. `Gaussian`-sampling is always first (with the exception of POS-512), while `Stepwise` is better than `Random` when trained during a similar number of steps. This proves that the sampling technique is, indeed, relevant. A more thorough statistical analysis is still required.
-As already mentiond in the [Training details](#training-details) section, the methodology used to extend sequence length during training is critical. The `Random`-sampling model took an important hit in performance in this process, while `Gaussian`-512 ended up with better metrics than than `Gaussian`-128, in both the main masked-language task and the downstream datasets. The key difference was that `Random` kept the optimizer intact while `Gaussian` used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.
 # Lessons and next steps

 ## Perplexity sampling
+The large amount of text in mC4-es makes training a language model within the time constraints of the Flax/JAX Community Event problematic. This motivated the exploration of sampling methods, with the goal of creating a subset of the dataset that would allow for the training of well-performing models with roughly one eighth of the data (~50M samples) and at approximately half the training steps.
 In order to efficiently build this subset of data, we decided to leverage a technique we call *perplexity sampling*, and whose origin can be traced to the construction of CCNet (Wenzek et al., 2020) and their high quality monolingual datasets from web-crawl data. In their work, they suggest the possibility of applying fast language models trained on high-quality data such as Wikipedia to filter out texts that deviate too much from correct expressions of a language (see Figure 1). They also released Kneser-Ney models (Ney et al., 1994) for 100 languages (Spanish included) as implemented in the KenLM library (Heafield, 2011) and trained on their respective Wikipedias.
 We then used the same setup and hyperparameters as [Liu et al. (2019)](https://arxiv.org/abs/1907.11692) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` and `Stepwise` trained for the 250k steps, while `Random` was stopped at 230k. `Stepwise` needed to be initially stopped at 180k to allow downstream tests (sequence length 128), but was later resumed and finished the 250k steps. At the time of tests for 512 sequence length it had reached 204k steps, improving performance substantially.
+Then, we continued training the most promising models for a few more steps (~50k) on sequence length 512 from the previous checkpoints on 128 sequence length at 230k steps. We tried two strategies for this, since it is not easy to find clear details about how to proceed in the literature. It turns out this decision had a big impact in the final performance.
+For `Random` sampling we trained with sequence length 512 during the last 25k steps of the 250k training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7.
 <figure>
 Please refer to the **evaluation** folder for training scripts for downstream tasks.
+Our first test, tagged [`beta`](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/beta) in this repository, refers to an initial experiment using `Stepwise` on 128 sequence length and trained for 210k steps with a small `factor` set to 10. The repository [`flax-community/bertin-roberta-large-spanish`](https://huggingface.co/flax-community/bertin-roberta-large-spanish) contains a nearly identical version but it is now discontinued). During the community event, the Barcelona Supercomputing Center (BSC) in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that were further cleaned up to the final 570GB. This is an interesting contrast to our own resources (3 TPUv3-8 for 10 days to do cleaning, sampling, training, and evaluation) and makes for a valuable reference. The BSC team evaluated our early release of the model [`beta`](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/beta) and the results can be seen in Table 1.
 Our final models were trained on a different number of steps and sequence lengths and achieve different—higher—masked-word prediction accuracies. Despite these limitations it is interesting to see the results they obtained using the early version of our model. Note that some of the datasets used for evaluation by BSC are not freely available, therefore it is not possible to verify the figures.
 <figure>
+<caption>Table 1. Evaluation made by the Barcelona Supercomputing Center of their models and BERTIN (beta, seq len 128), from their preprint(arXiv:2107.07253).</caption>
 | Dataset     | Metric   | RoBERTa-b | RoBERTa-l | BETO   | mBERT  | BERTIN (beta) |
 |-------------|----------|-----------|-----------|--------|--------|--------|
 | STS         | Combined |    0.8423 |    0.8420 | 0.8216 | 0.8249 | 0.7822 |
 | MLDoc       | Accuracy |    0.9595 |    0.9600 | 0.9650 | 0.9560 | **0.9673** |
 | PAWS-X      | F1       |    0.9035 |    0.9000 | 0.8915 | 0.9020 | 0.8820 |
+| XNLI        | Accuracy |    0.8016 |       WIP | 0.8130 | 0.7876 |    WIP |
 </figure>
 | [`bertin-project/bertin-base-stepwise`](https://huggingface.co/bertin-project/bertin-base-stepwise)                | 0.6487   |
 | [`bertin-project/bertin-base-gaussian`](https://huggingface.co/bertin-project/bertin-base-gaussian)                | 0.6608   |
 | [`bertin-project/bertin-base-random-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-random-exp-512seqlen)    | 0.5907   |
+| [`bertin-project/bertin-base-stepwise-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-stepwise-exp-512seqlen)  | 0.6818   |
 | [`bertin-project/bertin-base-gaussian-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-gaussian-exp-512seqlen)  | **0.6873**   |
 </figure>
 * **Stepwise**: [`bertin-project/bertin-base-stepwise`](https://huggingface.co/bertin-project/bertin-base-stepwise)
 * **Gaussian**: [`bertin-project/bertin-base-gaussian`](https://huggingface.co/bertin-project/bertin-base-gaussian)
 * **Random-512**: [`bertin-project/bertin-base-random-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-random-exp-512seqlen)
+* **Stepwise-512**: [`bertin-project/bertin-base-stepwise-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-stepwise-exp-512seqlen) (WIP)
 * **Gaussian-512**: [`bertin-project/bertin-base-gaussian-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-gaussian-exp-512seqlen)
 <figure>
 <caption>
+Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS and NER used max length 128 and batch size 16. Batch size for XNLI is 32 (max length 256). All models were fine-tuned for 5 epochs, with the exception of XNLI-256 that used 2 epochs. Stepwise used an older checkpoint with only 180.000 steps.
 </caption>
 |     Model    | POS (F1/Acc)         |     NER (F1/Acc)    | XNLI-256 (Acc) |
 |  Stepwise    |  0.9656 / 0.9707     | 0.8705 / 0.9809     |  0.7820        |
 |  Gaussian    |  0.9662 / 0.9709     | **0.8792 / 0.9816** |  0.7942        |
 | Random-512   |  0.9660 /  0.9707    | 0.8616 / 0.9803     |  0.7723        |
+| Stepwise-512 |        WIP           |        WPI          |  WIP           |
 | Gaussian-512 |  **0.9662 / 0.9714** | **0.8764 / 0.9819** |  0.7878        |
 </figure>
 |  Stepwise    |  0.9642 / 0.9693     | 0.8726 / 0.9818     |  0.8825*     |  0.7799    |
 |   Gaussian   |  0.9644 / 0.9692     | **0.8779 / 0.9820** |  0.8875*     |  0.7843    |
 | Random-512   |  0.9636 /  0.9690    | 0.8664 / 0.9806     |  0.6735*     |  0.7799    |
+| Stepwise-512 |        WIP           |        WPI          |  WIP         |  WIP       |
 | Gaussian-512 |  0.9646 / 0.9697     | 0.8707 / 0.9810     | **0.8965** * |  0.7843    |
 </figure>
 Results for PAWS-X seem surprising given the large differences in performance. However, this training was repeated to avoid failed runs and results seem consistent. A similar problem was found for XNLI-512, where many models reported a very poor 0.3333 accuracy on a first run (and even a second, in the case of BSC-BNE). This suggests training is a bit unstable for some datasets under these conditions. Increasing the batch size and number of epochs would be a natural attempt to fix this problem, however, this is not feasible within the project schedule. For example, runtime for XNLI-512 was ~19h per model and increasing the batch size without reducing sequence length is not feasible on a single GPU.
+We are also releasing the fine-tuned models for `Gaussian`-512 and making it our version [v1](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/v1) default to 128 sequence length since it experimentally shows better performance on fill-mask task, while also releasing the 512 sequence length version ([v1-512](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/v1-512) for fine-tuning.
 - POS: [`bertin-project/bertin-base-pos-conll2002-es`](https://huggingface.co/bertin-project/bertin-base-pos-conll2002-es/)
 - NER: [`bertin-project/bertin-base-ner-conll2002-es`](https://huggingface.co/bertin-project/bertin-base-ner-conll2002-es/)
 Note that this analysis is slightly more difficult to do in Spanish since gender concordance reveals hints beyond masks. Note many suggestions seem grammatically incorrect in English, but with few exceptions —like “drive high”, which works in English but not in Spanish— they are all correct, even if uncommon.
+Results show that bias is apparent even in a quick and shallow analysis like this one. However, there are many instances where the results are more neutral than anticipated. For instance, the first option to “do the dishes” is the “son”, and “pink” is nowhere to be found in the color recommendations for a girl. Women seem to drive “high”, “fast”, “strong” and “well”, but “not a lot”.
 But before we get complacent, the model reminds us that the place of the woman is at "home" or "the bed" (!), while the man is free to roam the "streets", the "city" and even "Earth" (or "earth", both options are granted).
 Geographical bias
+* My **(Spain's word for) car** is a Hyundai Accent.
   (Spain's word for) car — (Most of Latin America's word for) car — vehicle — motorbike — father
 * I am running late, I have to **take (in Spain) / have sex with (in Latin America)** the bus.
 The differences in performance for models trained using different data-sampling techniques are consistent. `Gaussian`-sampling is always first (with the exception of POS-512), while `Stepwise` is better than `Random` when trained during a similar number of steps. This proves that the sampling technique is, indeed, relevant. A more thorough statistical analysis is still required.
+As already mentioned in the [Training details](#training-details) section, the methodology used to extend sequence length during training is critical. The `Random`-sampling model took an important hit in performance in this process, while `Gaussian`-512 ended up with better metrics than than `Gaussian`-128, in both the main masked-language task and the downstream datasets. The key difference was that `Random` kept the optimizer intact while `Gaussian` used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.
 # Lessons and next steps