versae commited on
Commit
8040c8b
1 Parent(s): a7e270d
Files changed (1) hide show
  1. README.md +15 -11
README.md CHANGED
@@ -52,7 +52,7 @@ $ zcat c4/multilingual/c4-es*.tfrecord-*.json.gz | jq -r '.text | split(" ") | l
52
 
53
  ## Perplexity sampling
54
 
55
- The large amount of text in mC4-es makes training a language model within the time constraints of the Flax/JAX Community Event problematic. This motivated the exploration of sampling methods, with the goal of creating a subset of the dataset that would allow for the taining of well-performing models with roughly one eighth of the data (~50M samples) and at approximately half the training steps.
56
 
57
  In order to efficiently build this subset of data, we decided to leverage a technique we call *perplexity sampling*, and whose origin can be traced to the construction of CCNet (Wenzek et al., 2020) and their high quality monolingual datasets from web-crawl data. In their work, they suggest the possibility of applying fast language models trained on high-quality data such as Wikipedia to filter out texts that deviate too much from correct expressions of a language (see Figure 1). They also released Kneser-Ney models (Ney et al., 1994) for 100 languages (Spanish included) as implemented in the KenLM library (Heafield, 2011) and trained on their respective Wikipedias.
58
 
@@ -140,9 +140,9 @@ Although this is not a comprehensive analysis, we looked into the distribution o
140
 
141
  We then used the same setup and hyperparameters as [Liu et al. (2019)](https://arxiv.org/abs/1907.11692) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` and `Stepwise` trained for the 250k steps, while `Random` was stopped at 230k. `Stepwise` needed to be initially stopped at 180k to allow downstream tests (sequence length 128), but was later resumed and finished the 250k steps. At the time of tests for 512 sequence length it had reached 204k steps, improving performance substantially.
142
 
143
- Then, we continued training the most promising models for a few more steps (~50k) on sequence length 512 from the previous checkpoints on 128 sequence length at 230k steps. We tried two strategies for this, since it is not easy to find clear details about how to procede in the literature. It turns out this decision had a big impact in the final performance.
144
 
145
- For `Random` sampling we trained with seq len 512 during the last 25k steps of the 250k training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7.
146
 
147
  <figure>
148
 
@@ -159,13 +159,13 @@ Batch size was 2048 (8 TPU cores \* 256 batch size) for training with 128 sequen
159
 
160
  Please refer to the **evaluation** folder for training scripts for downstream tasks.
161
 
162
- Our first test, tagged [`beta`](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/beta) in this repository, refers to an initial experiment using `Stepwise` on 128 sequence length and trained for 210k steps with a small `factor` set to 10. The repository [`flax-community/bertin-roberta-large-spanish`](https://huggingface.co/flax-community/bertin-roberta-large-spanish) containes a nearly identical version but it is now discontinued). During the community event, the Barcelona Supercomputing Center (BSC) in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that were further cleaned up to the final 570GB. This is an interesting contrast to our own resources (3 TPUv3-8 for 10 days to do cleaning, sampling, taining, and evaluation) and makes for a valuable reference. The BSC team evaluated our early release of the model [`beta`](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/beta) and the results can be seen in Table 1.
163
 
164
  Our final models were trained on a different number of steps and sequence lengths and achieve different—higher—masked-word prediction accuracies. Despite these limitations it is interesting to see the results they obtained using the early version of our model. Note that some of the datasets used for evaluation by BSC are not freely available, therefore it is not possible to verify the figures.
165
 
166
  <figure>
167
 
168
- <caption>Table 1. Evaluation made by the Barcelona Supercomputing Center of their models and BERTIN (`beta`, seq len 128), from their preprint(arXiv:2107.07253).</caption>
169
 
170
  | Dataset | Metric | RoBERTa-b | RoBERTa-l | BETO | mBERT | BERTIN (beta) |
171
  |-------------|----------|-----------|-----------|--------|--------|--------|
@@ -176,7 +176,7 @@ Our final models were trained on a different number of steps and sequence length
176
  | STS | Combined | 0.8423 | 0.8420 | 0.8216 | 0.8249 | 0.7822 |
177
  | MLDoc | Accuracy | 0.9595 | 0.9600 | 0.9650 | 0.9560 | **0.9673** |
178
  | PAWS-X | F1 | 0.9035 | 0.9000 | 0.8915 | 0.9020 | 0.8820 |
179
- | XNLI | Accuracy | 0.8016 | WiP | 0.8130 | 0.7876 | WiP |
180
 
181
  </figure>
182
 
@@ -193,6 +193,7 @@ All of our models attained good accuracy values during training in the masked-la
193
  | [`bertin-project/bertin-base-stepwise`](https://huggingface.co/bertin-project/bertin-base-stepwise) | 0.6487 |
194
  | [`bertin-project/bertin-base-gaussian`](https://huggingface.co/bertin-project/bertin-base-gaussian) | 0.6608 |
195
  | [`bertin-project/bertin-base-random-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-random-exp-512seqlen) | 0.5907 |
 
196
  | [`bertin-project/bertin-base-gaussian-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-gaussian-exp-512seqlen) | **0.6873** |
197
 
198
  </figure>
@@ -209,12 +210,13 @@ For simplicity, we will abbreviate the different models as follows:
209
  * **Stepwise**: [`bertin-project/bertin-base-stepwise`](https://huggingface.co/bertin-project/bertin-base-stepwise)
210
  * **Gaussian**: [`bertin-project/bertin-base-gaussian`](https://huggingface.co/bertin-project/bertin-base-gaussian)
211
  * **Random-512**: [`bertin-project/bertin-base-random-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-random-exp-512seqlen)
 
212
  * **Gaussian-512**: [`bertin-project/bertin-base-gaussian-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-gaussian-exp-512seqlen)
213
 
214
  <figure>
215
 
216
  <caption>
217
- Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS and NER used max length 128 and batch size 16. Batch size for XNLI is 32 (max length 256). All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs. Stepwise used an older checkpoint with only 180.000 steps.
218
  </caption>
219
 
220
  | Model | POS (F1/Acc) | NER (F1/Acc) | XNLI-256 (Acc) |
@@ -227,6 +229,7 @@ Table 3. Metrics for different downstream tasks, comparing our different models
227
  | Stepwise | 0.9656 / 0.9707 | 0.8705 / 0.9809 | 0.7820 |
228
  | Gaussian | 0.9662 / 0.9709 | **0.8792 / 0.9816** | 0.7942 |
229
  | Random-512 | 0.9660 / 0.9707 | 0.8616 / 0.9803 | 0.7723 |
 
230
  | Gaussian-512 | **0.9662 / 0.9714** | **0.8764 / 0.9819** | 0.7878 |
231
 
232
  </figure>
@@ -244,6 +247,7 @@ Table 4. Metrics for different downstream tasks, comparing our different models
244
  | Stepwise | 0.9642 / 0.9693 | 0.8726 / 0.9818 | 0.8825* | 0.7799 |
245
  | Gaussian | 0.9644 / 0.9692 | **0.8779 / 0.9820** | 0.8875* | 0.7843 |
246
  | Random-512 | 0.9636 / 0.9690 | 0.8664 / 0.9806 | 0.6735* | 0.7799 |
 
247
  | Gaussian-512 | 0.9646 / 0.9697 | 0.8707 / 0.9810 | **0.8965** * | 0.7843 |
248
 
249
  </figure>
@@ -252,7 +256,7 @@ In addition to the tasks above, we also trained the [`beta`](https://huggingface
252
 
253
  Results for PAWS-X seem surprising given the large differences in performance. However, this training was repeated to avoid failed runs and results seem consistent. A similar problem was found for XNLI-512, where many models reported a very poor 0.3333 accuracy on a first run (and even a second, in the case of BSC-BNE). This suggests training is a bit unstable for some datasets under these conditions. Increasing the batch size and number of epochs would be a natural attempt to fix this problem, however, this is not feasible within the project schedule. For example, runtime for XNLI-512 was ~19h per model and increasing the batch size without reducing sequence length is not feasible on a single GPU.
254
 
255
- We are also releasing the fine-tuned models for `Gaussian`-512 and making it our version v1 default to 128 sequence length since it experimentally shows better performance on fill-mask task, while alse releasing the 512 sequence length version (v1-512) for fine-tuning.
256
 
257
  - POS: [`bertin-project/bertin-base-pos-conll2002-es`](https://huggingface.co/bertin-project/bertin-base-pos-conll2002-es/)
258
  - NER: [`bertin-project/bertin-base-ner-conll2002-es`](https://huggingface.co/bertin-project/bertin-base-ner-conll2002-es/)
@@ -269,7 +273,7 @@ Even if a rigorous analysis of bias is difficult, we should not use that excuse
269
 
270
  Note that this analysis is slightly more difficult to do in Spanish since gender concordance reveals hints beyond masks. Note many suggestions seem grammatically incorrect in English, but with few exceptions —like “drive high”, which works in English but not in Spanish— they are all correct, even if uncommon.
271
 
272
- Results show that bias is apparent even in a quick and shallow analysis like this one. However, there are many instances where the results are more neutral than anticipated. For instance, the first option to “do the dishes” is the “son”, and “pink” is nowhere to be found in the colour recommendations for a girl. Women seem to drive “high”, “fast”, “strong” and “well”, but “not a lot”.
273
 
274
  But before we get complacent, the model reminds us that the place of the woman is at "home" or "the bed" (!), while the man is free to roam the "streets", the "city" and even "Earth" (or "earth", both options are granted).
275
 
@@ -401,7 +405,7 @@ On race and origin
401
 
402
  Geographical bias
403
 
404
- * My **(Spain's word for) car** is a un Hyundai Accent.
405
  (Spain's word for) car — (Most of Latin America's word for) car — vehicle — motorbike — father
406
 
407
  * I am running late, I have to **take (in Spain) / have sex with (in Latin America)** the bus.
@@ -424,7 +428,7 @@ Our analysis of downstream tasks is not yet complete. It should be stressed that
424
 
425
  The differences in performance for models trained using different data-sampling techniques are consistent. `Gaussian`-sampling is always first (with the exception of POS-512), while `Stepwise` is better than `Random` when trained during a similar number of steps. This proves that the sampling technique is, indeed, relevant. A more thorough statistical analysis is still required.
426
 
427
- As already mentiond in the [Training details](#training-details) section, the methodology used to extend sequence length during training is critical. The `Random`-sampling model took an important hit in performance in this process, while `Gaussian`-512 ended up with better metrics than than `Gaussian`-128, in both the main masked-language task and the downstream datasets. The key difference was that `Random` kept the optimizer intact while `Gaussian` used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.
428
 
429
  # Lessons and next steps
430
 
 
52
 
53
  ## Perplexity sampling
54
 
55
+ The large amount of text in mC4-es makes training a language model within the time constraints of the Flax/JAX Community Event problematic. This motivated the exploration of sampling methods, with the goal of creating a subset of the dataset that would allow for the training of well-performing models with roughly one eighth of the data (~50M samples) and at approximately half the training steps.
56
 
57
  In order to efficiently build this subset of data, we decided to leverage a technique we call *perplexity sampling*, and whose origin can be traced to the construction of CCNet (Wenzek et al., 2020) and their high quality monolingual datasets from web-crawl data. In their work, they suggest the possibility of applying fast language models trained on high-quality data such as Wikipedia to filter out texts that deviate too much from correct expressions of a language (see Figure 1). They also released Kneser-Ney models (Ney et al., 1994) for 100 languages (Spanish included) as implemented in the KenLM library (Heafield, 2011) and trained on their respective Wikipedias.
58
 
 
140
 
141
  We then used the same setup and hyperparameters as [Liu et al. (2019)](https://arxiv.org/abs/1907.11692) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` and `Stepwise` trained for the 250k steps, while `Random` was stopped at 230k. `Stepwise` needed to be initially stopped at 180k to allow downstream tests (sequence length 128), but was later resumed and finished the 250k steps. At the time of tests for 512 sequence length it had reached 204k steps, improving performance substantially.
142
 
143
+ Then, we continued training the most promising models for a few more steps (~50k) on sequence length 512 from the previous checkpoints on 128 sequence length at 230k steps. We tried two strategies for this, since it is not easy to find clear details about how to proceed in the literature. It turns out this decision had a big impact in the final performance.
144
 
145
+ For `Random` sampling we trained with sequence length 512 during the last 25k steps of the 250k training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7.
146
 
147
  <figure>
148
 
 
159
 
160
  Please refer to the **evaluation** folder for training scripts for downstream tasks.
161
 
162
+ Our first test, tagged [`beta`](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/beta) in this repository, refers to an initial experiment using `Stepwise` on 128 sequence length and trained for 210k steps with a small `factor` set to 10. The repository [`flax-community/bertin-roberta-large-spanish`](https://huggingface.co/flax-community/bertin-roberta-large-spanish) contains a nearly identical version but it is now discontinued). During the community event, the Barcelona Supercomputing Center (BSC) in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that were further cleaned up to the final 570GB. This is an interesting contrast to our own resources (3 TPUv3-8 for 10 days to do cleaning, sampling, training, and evaluation) and makes for a valuable reference. The BSC team evaluated our early release of the model [`beta`](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/beta) and the results can be seen in Table 1.
163
 
164
  Our final models were trained on a different number of steps and sequence lengths and achieve different—higher—masked-word prediction accuracies. Despite these limitations it is interesting to see the results they obtained using the early version of our model. Note that some of the datasets used for evaluation by BSC are not freely available, therefore it is not possible to verify the figures.
165
 
166
  <figure>
167
 
168
+ <caption>Table 1. Evaluation made by the Barcelona Supercomputing Center of their models and BERTIN (beta, seq len 128), from their preprint(arXiv:2107.07253).</caption>
169
 
170
  | Dataset | Metric | RoBERTa-b | RoBERTa-l | BETO | mBERT | BERTIN (beta) |
171
  |-------------|----------|-----------|-----------|--------|--------|--------|
 
176
  | STS | Combined | 0.8423 | 0.8420 | 0.8216 | 0.8249 | 0.7822 |
177
  | MLDoc | Accuracy | 0.9595 | 0.9600 | 0.9650 | 0.9560 | **0.9673** |
178
  | PAWS-X | F1 | 0.9035 | 0.9000 | 0.8915 | 0.9020 | 0.8820 |
179
+ | XNLI | Accuracy | 0.8016 | WIP | 0.8130 | 0.7876 | WIP |
180
 
181
  </figure>
182
 
 
193
  | [`bertin-project/bertin-base-stepwise`](https://huggingface.co/bertin-project/bertin-base-stepwise) | 0.6487 |
194
  | [`bertin-project/bertin-base-gaussian`](https://huggingface.co/bertin-project/bertin-base-gaussian) | 0.6608 |
195
  | [`bertin-project/bertin-base-random-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-random-exp-512seqlen) | 0.5907 |
196
+ | [`bertin-project/bertin-base-stepwise-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-stepwise-exp-512seqlen) | 0.6818 |
197
  | [`bertin-project/bertin-base-gaussian-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-gaussian-exp-512seqlen) | **0.6873** |
198
 
199
  </figure>
 
210
  * **Stepwise**: [`bertin-project/bertin-base-stepwise`](https://huggingface.co/bertin-project/bertin-base-stepwise)
211
  * **Gaussian**: [`bertin-project/bertin-base-gaussian`](https://huggingface.co/bertin-project/bertin-base-gaussian)
212
  * **Random-512**: [`bertin-project/bertin-base-random-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-random-exp-512seqlen)
213
+ * **Stepwise-512**: [`bertin-project/bertin-base-stepwise-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-stepwise-exp-512seqlen) (WIP)
214
  * **Gaussian-512**: [`bertin-project/bertin-base-gaussian-exp-512seqlen`](https://huggingface.co/bertin-project/bertin-base-gaussian-exp-512seqlen)
215
 
216
  <figure>
217
 
218
  <caption>
219
+ Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS and NER used max length 128 and batch size 16. Batch size for XNLI is 32 (max length 256). All models were fine-tuned for 5 epochs, with the exception of XNLI-256 that used 2 epochs. Stepwise used an older checkpoint with only 180.000 steps.
220
  </caption>
221
 
222
  | Model | POS (F1/Acc) | NER (F1/Acc) | XNLI-256 (Acc) |
 
229
  | Stepwise | 0.9656 / 0.9707 | 0.8705 / 0.9809 | 0.7820 |
230
  | Gaussian | 0.9662 / 0.9709 | **0.8792 / 0.9816** | 0.7942 |
231
  | Random-512 | 0.9660 / 0.9707 | 0.8616 / 0.9803 | 0.7723 |
232
+ | Stepwise-512 | WIP | WPI | WIP |
233
  | Gaussian-512 | **0.9662 / 0.9714** | **0.8764 / 0.9819** | 0.7878 |
234
 
235
  </figure>
 
247
  | Stepwise | 0.9642 / 0.9693 | 0.8726 / 0.9818 | 0.8825* | 0.7799 |
248
  | Gaussian | 0.9644 / 0.9692 | **0.8779 / 0.9820** | 0.8875* | 0.7843 |
249
  | Random-512 | 0.9636 / 0.9690 | 0.8664 / 0.9806 | 0.6735* | 0.7799 |
250
+ | Stepwise-512 | WIP | WPI | WIP | WIP |
251
  | Gaussian-512 | 0.9646 / 0.9697 | 0.8707 / 0.9810 | **0.8965** * | 0.7843 |
252
 
253
  </figure>
 
256
 
257
  Results for PAWS-X seem surprising given the large differences in performance. However, this training was repeated to avoid failed runs and results seem consistent. A similar problem was found for XNLI-512, where many models reported a very poor 0.3333 accuracy on a first run (and even a second, in the case of BSC-BNE). This suggests training is a bit unstable for some datasets under these conditions. Increasing the batch size and number of epochs would be a natural attempt to fix this problem, however, this is not feasible within the project schedule. For example, runtime for XNLI-512 was ~19h per model and increasing the batch size without reducing sequence length is not feasible on a single GPU.
258
 
259
+ We are also releasing the fine-tuned models for `Gaussian`-512 and making it our version [v1](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/v1) default to 128 sequence length since it experimentally shows better performance on fill-mask task, while also releasing the 512 sequence length version ([v1-512](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/v1-512) for fine-tuning.
260
 
261
  - POS: [`bertin-project/bertin-base-pos-conll2002-es`](https://huggingface.co/bertin-project/bertin-base-pos-conll2002-es/)
262
  - NER: [`bertin-project/bertin-base-ner-conll2002-es`](https://huggingface.co/bertin-project/bertin-base-ner-conll2002-es/)
 
273
 
274
  Note that this analysis is slightly more difficult to do in Spanish since gender concordance reveals hints beyond masks. Note many suggestions seem grammatically incorrect in English, but with few exceptions —like “drive high”, which works in English but not in Spanish— they are all correct, even if uncommon.
275
 
276
+ Results show that bias is apparent even in a quick and shallow analysis like this one. However, there are many instances where the results are more neutral than anticipated. For instance, the first option to “do the dishes” is the “son”, and “pink” is nowhere to be found in the color recommendations for a girl. Women seem to drive “high”, “fast”, “strong” and “well”, but “not a lot”.
277
 
278
  But before we get complacent, the model reminds us that the place of the woman is at "home" or "the bed" (!), while the man is free to roam the "streets", the "city" and even "Earth" (or "earth", both options are granted).
279
 
 
405
 
406
  Geographical bias
407
 
408
+ * My **(Spain's word for) car** is a Hyundai Accent.
409
  (Spain's word for) car — (Most of Latin America's word for) car — vehicle — motorbike — father
410
 
411
  * I am running late, I have to **take (in Spain) / have sex with (in Latin America)** the bus.
 
428
 
429
  The differences in performance for models trained using different data-sampling techniques are consistent. `Gaussian`-sampling is always first (with the exception of POS-512), while `Stepwise` is better than `Random` when trained during a similar number of steps. This proves that the sampling technique is, indeed, relevant. A more thorough statistical analysis is still required.
430
 
431
+ As already mentioned in the [Training details](#training-details) section, the methodology used to extend sequence length during training is critical. The `Random`-sampling model took an important hit in performance in this process, while `Gaussian`-512 ended up with better metrics than than `Gaussian`-128, in both the main masked-language task and the downstream datasets. The key difference was that `Random` kept the optimizer intact while `Gaussian` used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.
432
 
433
  # Lessons and next steps
434