Paulo commited on
Commit
aab119d
2 Parent(s): 9caea08 1b4ae8e

Merge branch 'main' of https://huggingface.co/bertin-project/bertin-roberta-base-spanish into main

Browse files
README.md CHANGED
@@ -143,7 +143,7 @@ We then used the same setup and hyperparameters as [Liu et al. (2019)](https://a
143
 
144
  Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
145
 
146
- For `Random` sampling we trained with seq len 512 during the last 20k steps of the 250 training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7:
147
 
148
  <figure>
149
 
@@ -154,10 +154,12 @@ For `Random` sampling we trained with seq len 512 during the last 20k steps of t
154
 
155
  For `Gaussian` sampling we started a new optimizer after 230 steps with 128 sequence length, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for `Random` (512), a difference much larger than that of their respective -128 models (0.6520 for `Random`, 0.6608 for `Gaussian`).
156
 
157
- Batch size was 2048 for training with 128 sequence length, and 384 for 512 sequence length, with no change in learning rate. Warmup steps for 512 was 500.
158
 
159
  ## Results
160
 
 
 
161
  Our first test, tagged `beta` in this repository, refers to an initial experiment using `Stepwise` on 128 sequence length and trained for 210k steps. Two nearly identical versions of this model can be found, one at **bertin-roberta-base-spanish** and the other at **flax-community/bertin-roberta-large-spanish** (do note this is **not our best model**!). During the community event, the Barcelona Supercomputing Center (BSC) in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that were further cleaned up to the final 570GB. This is an interesting contrast to our own resources (3xTPUv3-8 for 10 days to do cleaning, sampling, taining, and evaluation) and makes for a valuable reference. The BSC team evaluated our early release of the model `beta` and the results can be seen in Table 1.
162
 
163
  Our final models were trained on a different number of steps and sequence lengths and achieve different—higher—masked-word prediction accuracies. Despite these limitations it is interesting to see the results they obtained using the early version of our model. Note that some of the datasets used for evaluation by BSC are not freely available, therefore it is not possible to verify the figures.
@@ -213,20 +215,37 @@ For simplicity, we will abbreviate the different models as follows:
213
  <figure>
214
 
215
  <caption>
216
- Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 8. Batch size for XNLI (length 256) is 32, while we needed to use 16 for XNLI (length 512) All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs. Results marked with * indicate a repetition.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
217
  </caption>
218
 
219
- | Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI-256 (Acc) | XNLI-512 (Acc) |
220
- |--------------|----------------------|---------------------|--------------|----------------|--------------|
221
- | BERT-m | 0.9629 / 0.9687 | 0.8539 / 0.9779 | 0.5765* | 0.7852 | 0.7606 |
222
- | BERT-wwm | 0.9642 / 0.9700 | 0.8579 / 0.9783 | 0.8720* | **0.8186** | **0.8012** |
223
- | BSC-BNE | 0.9659 / 0.9707 | 0.8700 / 0.9807 | 0.5765* | 0.8178 | 0.3333* |
224
- | Beta | 0.9638 / 0.9690 | 0.8725 / 0.9812 | 0.5765* | — | 0.7751* |
225
- | Random | 0.9656 / 0.9704 | 0.8704 / 0.9807 | 0.8800* | 0.7745 | 0.7795 |
226
- | Stepwise | 0.9656 / 0.9707 | 0.8705 / 0.9809 | 0.8825* | 0.7820 | 0.7799 |
227
- | Gaussian | 0.9662 / 0.9709 | **0.8792 / 0.9816** | 0.8875* | 0.7942 | 0.7843 |
228
- | Random-512 | 0.9660 / 0.9707 | 0.8616 / 0.9803 | 0.6735* | 0.7723 | 0.7799 |
229
- | Gaussian-512 | **0.9662 / 0.9714** | **0.8764 / 0.9819** | **0.8965** * | 0.7878 | 0.7843 |
230
 
231
  </figure>
232
 
@@ -377,9 +396,9 @@ Geographical bias
377
 
378
  The performance of our models has been, in general, very good. Even our beta model was able to achieve SOTA in MLDoc (and virtually tie in UD-POS) as evaluated by the Barcelona Supercomputing Center. In the main masked-language task our models reach values between 0.65 and 0.69, which foretells good results for downstream tasks.
379
 
380
- Our analysis of downstream tasks is not yet complete. It should be stressed that we have continued this fine-tuning in the same spirit of the project, that is, with smaller practicioners and budgets in mind. Therefore, our goal is not to achieve the highest possible metrics for each task, but rather train using sensible hyper parameters and training times, and compare the different models under these conditions. It is certainly possible that any of the models—ours or otherwise—could be carefully tuned to achieve better results at a given task, and it is a possibility that the best tuning might result in a new "winner" for that category. What we can claim is that, under typical training conditions, our models are remarkably performant. In particular, Gaussian-512 is clearly superior, taking the lead in three of the four tasks analysed.
381
 
382
- The differences in performance for models trained using different data-sampling techniques are consistent. Gaussian-sampling is always first, while Stepwise is only marginally better than Random. This proves that the sampling technique is, indeed, relevant.
383
 
384
  As already mentiond in the Training details section, the methodology used to extend sequence length during training is critical. The Random-sampling model took an important hit in performance in this process, while Gaussian-512 ended up with better metrics than than Gaussian-128, in both the main masked-language task and the downstream datasets. The key difference was that Random kept the optimizer intact while Gaussian used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.
385
 
 
143
 
144
  Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
145
 
146
+ For `Random` sampling we trained with seq len 512 during the last 20 steps of the 250 training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7:
147
 
148
  <figure>
149
 
 
154
 
155
  For `Gaussian` sampling we started a new optimizer after 230 steps with 128 sequence length, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for `Random` (512), a difference much larger than that of their respective -128 models (0.6520 for `Random`, 0.6608 for `Gaussian`).
156
 
157
+ Batch size was 256 for training with 128 sequence length, and 48 for 512 sequence length, with no change in learning rate. Warmup steps for 512 was 500.
158
 
159
  ## Results
160
 
161
+ Please refer to the **evaluation** folder for training scripts for downstream tasks.
162
+
163
  Our first test, tagged `beta` in this repository, refers to an initial experiment using `Stepwise` on 128 sequence length and trained for 210k steps. Two nearly identical versions of this model can be found, one at **bertin-roberta-base-spanish** and the other at **flax-community/bertin-roberta-large-spanish** (do note this is **not our best model**!). During the community event, the Barcelona Supercomputing Center (BSC) in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that were further cleaned up to the final 570GB. This is an interesting contrast to our own resources (3xTPUv3-8 for 10 days to do cleaning, sampling, taining, and evaluation) and makes for a valuable reference. The BSC team evaluated our early release of the model `beta` and the results can be seen in Table 1.
164
 
165
  Our final models were trained on a different number of steps and sequence lengths and achieve different—higher—masked-word prediction accuracies. Despite these limitations it is interesting to see the results they obtained using the early version of our model. Note that some of the datasets used for evaluation by BSC are not freely available, therefore it is not possible to verify the figures.
 
215
  <figure>
216
 
217
  <caption>
218
+ Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS and NER used max length 128 and batch size 128. Batch size for XNLI (length 256) is 256. All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs. Stepwise used an older checkpoint with only 180.000 steps.
219
+ </caption>
220
+
221
+ | Model | POS (F1/Acc) | NER (F1/Acc) | XNLI-256 (Acc) |
222
+ |--------------|----------------------|---------------------|----------------|
223
+ | BERT-m | 0.9629 / 0.9687 | 0.8539 / 0.9779 | 0.7852 |
224
+ | BERT-wwm | 0.9642 / 0.9700 | 0.8579 / 0.9783 | **0.8186** |
225
+ | BSC-BNE | 0.9659 / 0.9707 | 0.8700 / 0.9807 | 0.8178 |
226
+ | Beta | 0.9638 / 0.9690 | 0.8725 / 0.9812 | — |
227
+ | Random | 0.9656 / 0.9704 | 0.8704 / 0.9807 | 0.7745 |
228
+ | Stepwise | 0.9656 / 0.9707 | 0.8705 / 0.9809 | 0.7820 |
229
+ | Gaussian | 0.9662 / 0.9709 | **0.8792 / 0.9816** | 0.7942 |
230
+ | Random-512 | 0.9660 / 0.9707 | 0.8616 / 0.9803 | 0.7723 |
231
+ | Gaussian-512 | **0.9662 / 0.9714** | **0.8764 / 0.9819** | 0.7878 |
232
+
233
+ </figure>
234
+
235
+ Table 4. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 128. Batch size for XNLI 128 for XNLI (length 512) All models were fine-tuned for 5 epochs. Results marked with * indicate a repetition. Stepwise checkpoint had 204.000 steps during these tests.
236
  </caption>
237
 
238
+ | Model | POS (F1/Acc) | NER (F1/Acc) | PAWS-X (Acc) | XNLI (Acc) |
239
+ |--------------|----------------------|---------------------|--------------|------------|
240
+ | BERT-m | 0.9630 / 0.9689 | 0.8616 / 0.9790 | 0.5765* | 0.7606 |
241
+ | BERT-wwm | 0.9639 / 0.9693 | 0.8596 / 0.9790 | 0.8720* | **0.8012** |
242
+ | BSC-BNE | **0.9655 / 0.9706** | 0.8764 / 0.9818 | 0.5765* | 0.3333* |
243
+ | Beta | 0.9616 / 0.9669 | 0.8640 / 0.9799 | 0.5765* | 0.7751* |
244
+ | Random | 0.9651 / 0.9700 | 0.8638 / 0.9802 | 0.8800* | 0.7795 |
245
+ | Stepwise | 0.9642 / 0.9693 | 0.8726 / 0.9818 | 0.8825* | 0.7799 |
246
+ | Gaussian | 0.9644 / 0.9692 | **0.8779 / 0.9820** | 0.8875* | 0.7843 |
247
+ | Random-512 | 0.9636 / 0.9690 | 0.8664 / 0.9806 | 0.6735* | 0.7799 |
248
+ | Gaussian-512 | 0.9646 / 0.9697 | 0.8707 / 0.9810 | **0.8965** * | 0.7843 |
249
 
250
  </figure>
251
 
 
396
 
397
  The performance of our models has been, in general, very good. Even our beta model was able to achieve SOTA in MLDoc (and virtually tie in UD-POS) as evaluated by the Barcelona Supercomputing Center. In the main masked-language task our models reach values between 0.65 and 0.69, which foretells good results for downstream tasks.
398
 
399
+ Our analysis of downstream tasks is not yet complete. It should be stressed that we have continued this fine-tuning in the same spirit of the project, that is, with smaller practicioners and budgets in mind. Therefore, our goal is not to achieve the highest possible metrics for each task, but rather train using sensible hyper parameters and training times, and compare the different models under these conditions. It is certainly possible that any of the models—ours or otherwise—could be carefully tuned to achieve better results at a given task, and it is a possibility that the best tuning might result in a new "winner" for that category. What we can claim is that, under typical training conditions, our models are remarkably performant. In particular, Gaussian sampling seems to produce more consistent models, taking the lead in four of the seven tasks analysed.
400
 
401
+ The differences in performance for models trained using different data-sampling techniques are consistent. Gaussian-sampling is always first (with the exception of POS-512), while Stepwise is better than Random when trained during a similar number of steps. This proves that the sampling technique is, indeed, relevant.
402
 
403
  As already mentiond in the Training details section, the methodology used to extend sequence length during training is critical. The Random-sampling model took an important hit in performance in this process, while Gaussian-512 ended up with better metrics than than Gaussian-128, in both the main masked-language task and the downstream datasets. The key difference was that Random kept the optimizer intact while Gaussian used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.
404
 
evaluation/paws.yaml CHANGED
@@ -36,8 +36,6 @@ parameters:
36
  value: ./outputs
37
  overwrite_output_dir:
38
  value: true
39
- resume_from_checkpoint:
40
- value: false
41
  max_seq_length:
42
  value: 512
43
  pad_to_max_length:
 
36
  value: ./outputs
37
  overwrite_output_dir:
38
  value: true
 
 
39
  max_seq_length:
40
  value: 512
41
  pad_to_max_length:
evaluation/token.yaml ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: BERTIN NER and POS es
2
+ project: bertin-eval
3
+ enitity: versae
4
+ program: run_ner.py
5
+ command:
6
+ - ${env}
7
+ - ${interpreter}
8
+ - ${program}
9
+ - ${args}
10
+ method: grid
11
+ metric:
12
+ name: eval/accuracy
13
+ goal: maximize
14
+ parameters:
15
+ model_name_or_path:
16
+ values:
17
+ - bertin-project/bertin-base-gaussian-exp-512seqlen
18
+ - bertin-project/bertin-base-random-exp-512seqlen
19
+ - bertin-project/bertin-base-gaussian
20
+ - bertin-project/bertin-base-stepwise
21
+ - bertin-project/bertin-base-random
22
+ - bertin-project/bertin-roberta-base-spanish
23
+ - flax-community/bertin-roberta-large-spanish
24
+ - BSC-TeMU/roberta-base-bne
25
+ - dccuchile/bert-base-spanish-wwm-cased
26
+ - bert-base-multilingual-cased
27
+ num_train_epochs:
28
+ values: [5]
29
+ task_name:
30
+ values:
31
+ - ner
32
+ - pos
33
+ dataset_name:
34
+ value: conll2002
35
+ dataset_config_name:
36
+ value: es
37
+ output_dir:
38
+ value: ./outputs
39
+ overwrite_output_dir:
40
+ value: true
41
+ pad_to_max_length:
42
+ value: true
43
+ per_device_train_batch_size:
44
+ value: 16
45
+ per_device_eval_batch_size:
46
+ value: 16
47
+ save_total_limit:
48
+ value: 1
49
+ do_train:
50
+ value: true
51
+ do_eval:
52
+ value: true
53
+
evaluation/xnli.yaml CHANGED
@@ -36,8 +36,6 @@ parameters:
36
  value: ./outputs
37
  overwrite_output_dir:
38
  value: true
39
- resume_from_checkpoint:
40
- value: false
41
  max_seq_length:
42
  value: 512
43
  pad_to_max_length:
 
36
  value: ./outputs
37
  overwrite_output_dir:
38
  value: true
 
 
39
  max_seq_length:
40
  value: 512
41
  pad_to_max_length:
images/bertin.png CHANGED
run_mlm_flax_stream.py CHANGED
@@ -384,8 +384,8 @@ def to_f32(t):
384
 
385
 
386
  def convert(output_dir, destination_dir="./"):
387
- shutil.copyfile(Path(output_dir) / "flax_model.msgpack", destination_dir)
388
- shutil.copyfile(Path(output_dir) / "config.json", destination_dir)
389
  # Saving extra files from config.json and tokenizer.json files
390
  tokenizer = AutoTokenizer.from_pretrained(destination_dir)
391
  tokenizer.save_pretrained(destination_dir)
@@ -611,8 +611,8 @@ if __name__ == "__main__":
611
 
612
  # Setup train state
613
  state = train_state.TrainState.create(apply_fn=model.__call__, params=model.params, tx=adamw)
614
- saved_step = 0
615
- if "checkpoint" in model_args.model_name_or_path:
616
  params, opt_state, saved_step, args, data_collator = restore_checkpoint(model_args.model_name_or_path, state)
617
  # Create learning rate schedule
618
  warmup_fn = optax.linear_schedule(
@@ -714,8 +714,9 @@ if __name__ == "__main__":
714
  max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length)
715
  eval_samples = advance_iter_and_group_samples(training_iter, data_args.num_eval_samples, max_seq_length)
716
 
 
717
  steps = tqdm(range(num_train_steps), desc="Training...", position=0)
718
- for step in range(saved_step, num_train_steps):
719
  if step < saved_step:
720
  steps.update(1)
721
  continue
@@ -827,5 +828,5 @@ if __name__ == "__main__":
827
  training_args.output_dir,
828
  params=params,
829
  push_to_hub=training_args.push_to_hub,
830
- commit_message=last_desc,
831
  )
 
384
 
385
 
386
  def convert(output_dir, destination_dir="./"):
387
+ shutil.copyfile(Path(output_dir) / "flax_model.msgpack", Path(destination_dir) / "flax_model.msgpack")
388
+ shutil.copyfile(Path(output_dir) / "config.json", Path(destination_dir) / "config.json")
389
  # Saving extra files from config.json and tokenizer.json files
390
  tokenizer = AutoTokenizer.from_pretrained(destination_dir)
391
  tokenizer.save_pretrained(destination_dir)
 
611
 
612
  # Setup train state
613
  state = train_state.TrainState.create(apply_fn=model.__call__, params=model.params, tx=adamw)
614
+ saved_step = -1
615
+ if model_args.model_name_or_path and "checkpoint" in model_args.model_name_or_path:
616
  params, opt_state, saved_step, args, data_collator = restore_checkpoint(model_args.model_name_or_path, state)
617
  # Create learning rate schedule
618
  warmup_fn = optax.linear_schedule(
 
714
  max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length)
715
  eval_samples = advance_iter_and_group_samples(training_iter, data_args.num_eval_samples, max_seq_length)
716
 
717
+ last_desc = ""
718
  steps = tqdm(range(num_train_steps), desc="Training...", position=0)
719
+ for step in range(num_train_steps):
720
  if step < saved_step:
721
  steps.update(1)
722
  continue
 
828
  training_args.output_dir,
829
  params=params,
830
  push_to_hub=training_args.push_to_hub,
831
+ commit_message=last_desc or "Saving model after training",
832
  )
utils/download_mc4es_sampled.py ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import io
2
+ import gzip
3
+ import json
4
+ import sys
5
+
6
+ import requests
7
+ from tqdm import tqdm
8
+
9
+ _DATA_URL_TRAIN = "https://huggingface.co/datasets/bertin-project/mc4-es-sampled/resolve/main/mc4-es-train-50M-{config}-shard-{index:04d}-of-{n_shards:04d}.json.gz"
10
+
11
+
12
+ def main(config="stepwise"):
13
+ data_urls = [
14
+ _DATA_URL_TRAIN.format(
15
+ config=config,
16
+ index=index + 1,
17
+ n_shards=1024,
18
+ )
19
+ for index in range(1024)
20
+ ]
21
+ with open(f"mc4-es-train-50M-{config}.jsonl", "w") as f:
22
+ for dara_url in tqdm(data_urls):
23
+ response = requests.get(dara_url)
24
+ bio = io.BytesIO(response.content)
25
+ with gzip.open(bio, "rt", encoding="utf8") as g:
26
+ for line in g:
27
+ json_line = json.loads(line.strip())
28
+ f.write(json.dumps(json_line) + "\n")
29
+
30
+
31
+ if __name__ == "__main__":
32
+ main(sys.argv[1])