Merge branch 'main' of https://huggingface.co/bertin-project/bertin-roberta-base-spanish into main

Browse files

Files changed (7) hide show

README.md +35 -16
evaluation/paws.yaml +0 -2
evaluation/token.yaml +53 -0
evaluation/xnli.yaml +0 -2
images/bertin.png +0 -0
run_mlm_flax_stream.py +7 -6
utils/download_mc4es_sampled.py +32 -0

README.md CHANGED Viewed

@@ -143,7 +143,7 @@ We then used the same setup and hyperparameters as [Liu et al. (2019)](https://a
 Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
-For `Random` sampling we trained with seq len 512 during the last 20k steps of the 250 training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7:
 <figure>
@@ -154,10 +154,12 @@ For `Random` sampling we trained with seq len 512 during the last 20k steps of t
 For `Gaussian` sampling we started a new optimizer after 230 steps with 128 sequence length, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for `Random` (512), a difference much larger than that of their respective -128 models (0.6520 for `Random`, 0.6608 for `Gaussian`).
-Batch size was 2048 for training with 128 sequence length, and 384 for 512 sequence length, with no change in learning rate. Warmup steps for 512 was 500.
 ## Results
 Our first test, tagged `beta` in this repository, refers to an initial experiment using `Stepwise` on 128 sequence length and trained for 210k steps. Two nearly identical versions of this model can be found, one at **bertin-roberta-base-spanish** and the other at **flax-community/bertin-roberta-large-spanish** (do note this is **not our best model**!). During the community event, the Barcelona Supercomputing Center (BSC) in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that were further cleaned up to the final 570GB. This is an interesting contrast to our own resources (3xTPUv3-8 for 10 days to do cleaning, sampling, taining, and evaluation) and makes for a valuable reference. The BSC team evaluated our early release of the model `beta` and the results can be seen in Table 1.
 Our final models were trained on a different number of steps and sequence lengths and achieve different—higher—masked-word prediction accuracies. Despite these limitations it is interesting to see the results they obtained using the early version of our model. Note that some of the datasets used for evaluation by BSC are not freely available, therefore it is not possible to verify the figures.
@@ -213,20 +215,37 @@ For simplicity, we will abbreviate the different models as follows:
 <figure>
 <caption>
-Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 8. Batch size for XNLI (length 256) is 32, while we needed to use 16 for XNLI (length 512) All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs. Results marked with * indicate a repetition.
 </caption>
-|     Model    | POS (F1/Acc)         |     NER (F1/Acc)    | PAWS-X (Acc) | XNLI-256 (Acc) | XNLI-512 (Acc) |
-|--------------|----------------------|---------------------|--------------|----------------|--------------|
-|   BERT-m     |  0.9629 / 0.9687     | 0.8539 / 0.9779     |  0.5765*     |  0.7852        |  0.7606  |
-|  BERT-wwm    |  0.9642 / 0.9700     | 0.8579 / 0.9783     |  0.8720*     |  **0.8186**    |  **0.8012**  |
-|   BSC-BNE    |  0.9659 / 0.9707     | 0.8700 / 0.9807     |  0.5765*     |  0.8178        |  0.3333*  |
-|    Beta      |  0.9638 / 0.9690     | 0.8725 / 0.9812     |  0.5765*     |     —          |  0.7751*  |
-|    Random    |  0.9656 / 0.9704     | 0.8704 / 0.9807     |  0.8800*     |  0.7745        |  0.7795  |
-|  Stepwise    |  0.9656 / 0.9707     | 0.8705 / 0.9809     |  0.8825*     |  0.7820        |  0.7799  |
-|   Gaussian   |  0.9662 / 0.9709     | **0.8792 / 0.9816** |  0.8875*     |  0.7942        |  0.7843  |
-| Random-512   |  0.9660 /  0.9707    | 0.8616 / 0.9803     |  0.6735*     |  0.7723        |  0.7799  |
-| Gaussian-512 |  **0.9662 / 0.9714** | **0.8764 / 0.9819** | **0.8965** * |  0.7878        |  0.7843  |
 </figure>
@@ -377,9 +396,9 @@ Geographical bias
 The performance of our models has been, in general, very good. Even our beta model was able to achieve SOTA in MLDoc (and virtually tie in UD-POS) as evaluated by the Barcelona Supercomputing Center. In the main masked-language task our models reach values between 0.65 and 0.69, which foretells good results for downstream tasks.
-Our analysis of downstream tasks is not yet complete. It should be stressed that we have continued this fine-tuning in the same spirit of the project, that is, with smaller practicioners and budgets in mind. Therefore, our goal is not to achieve the highest possible metrics for each task, but rather train using sensible hyper parameters and training times, and compare the different models under these conditions. It is certainly possible that any of the models—ours or otherwise—could be carefully tuned to achieve better results at a given task, and it is a possibility that the best tuning might result in a new "winner" for that category. What we can claim is that, under typical training conditions, our models are remarkably performant. In particular, Gaussian-512 is clearly superior, taking the lead in three of the four tasks analysed.
-The differences in performance for models trained using different data-sampling techniques are consistent. Gaussian-sampling is always first, while Stepwise is only marginally better than Random. This proves that the sampling technique is, indeed, relevant.
 As already mentiond in the Training details section, the methodology used to extend sequence length during training is critical. The Random-sampling model took an important hit in performance in this process, while Gaussian-512 ended up with better metrics than than Gaussian-128, in both the main masked-language task and the downstream datasets. The key difference was that Random kept the optimizer intact while Gaussian used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.

 Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
+For `Random` sampling we trained with seq len 512 during the last 20 steps of the 250 training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7:
 <figure>
 For `Gaussian` sampling we started a new optimizer after 230 steps with 128 sequence length, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for `Random` (512), a difference much larger than that of their respective -128 models (0.6520 for `Random`, 0.6608 for `Gaussian`).
+Batch size was 256 for training with 128 sequence length, and 48 for 512 sequence length, with no change in learning rate. Warmup steps for 512 was 500.
 ## Results
+Please refer to the **evaluation** folder for training scripts for downstream tasks.
 Our first test, tagged `beta` in this repository, refers to an initial experiment using `Stepwise` on 128 sequence length and trained for 210k steps. Two nearly identical versions of this model can be found, one at **bertin-roberta-base-spanish** and the other at **flax-community/bertin-roberta-large-spanish** (do note this is **not our best model**!). During the community event, the Barcelona Supercomputing Center (BSC) in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that were further cleaned up to the final 570GB. This is an interesting contrast to our own resources (3xTPUv3-8 for 10 days to do cleaning, sampling, taining, and evaluation) and makes for a valuable reference. The BSC team evaluated our early release of the model `beta` and the results can be seen in Table 1.
 Our final models were trained on a different number of steps and sequence lengths and achieve different—higher—masked-word prediction accuracies. Despite these limitations it is interesting to see the results they obtained using the early version of our model. Note that some of the datasets used for evaluation by BSC are not freely available, therefore it is not possible to verify the figures.
 <figure>
 <caption>
+Table 3. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS and NER used max length 128 and batch size 128. Batch size for XNLI (length 256) is 256. All models were fine-tuned for 5 epochs, with the exception fo XNLI-256 that used 2 epochs. Stepwise used an older checkpoint with only 180.000 steps.
+</caption>
+|     Model    | POS (F1/Acc)         |     NER (F1/Acc)    | XNLI-256 (Acc) |
+|--------------|----------------------|---------------------|----------------|
+|   BERT-m     |  0.9629 / 0.9687     | 0.8539 / 0.9779     |  0.7852        |
+|  BERT-wwm    |  0.9642 / 0.9700     | 0.8579 / 0.9783     |  **0.8186**    |
+|   BSC-BNE    |  0.9659 / 0.9707     | 0.8700 / 0.9807     |  0.8178        |
+|    Beta      |  0.9638 / 0.9690     | 0.8725 / 0.9812     |     —          |
+|    Random    |  0.9656 / 0.9704     | 0.8704 / 0.9807     |  0.7745        |
+|  Stepwise    |  0.9656 / 0.9707     | 0.8705 / 0.9809     |  0.7820        |
+|   Gaussian   |  0.9662 / 0.9709     | **0.8792 / 0.9816** |  0.7942        |
+| Random-512   |  0.9660 /  0.9707    | 0.8616 / 0.9803     |  0.7723        |
+| Gaussian-512 |  **0.9662 / 0.9714** | **0.8764 / 0.9819** |  0.7878        |
+</figure>
+Table 4. Metrics for different downstream tasks, comparing our different models as well as other relevant BERT variations from the literature. Dataset for POS and NER is CoNLL 2002. POS, NER and PAWS-X used max length 512 and batch size 128. Batch size for XNLI 128 for XNLI (length 512) All models were fine-tuned for 5 epochs. Results marked with * indicate a repetition. Stepwise checkpoint had 204.000 steps during these tests.
 </caption>
+|     Model    | POS (F1/Acc)         |     NER (F1/Acc)    | PAWS-X (Acc) | XNLI (Acc) |
+|--------------|----------------------|---------------------|--------------|------------|
+|   BERT-m     |  0.9630 / 0.9689     | 0.8616 / 0.9790     |  0.5765*     |  0.7606    |
+|  BERT-wwm    |  0.9639 / 0.9693     | 0.8596 / 0.9790     |  0.8720*     |  **0.8012** |
+|   BSC-BNE    |  **0.9655 / 0.9706** | 0.8764 / 0.9818     |  0.5765*     |  0.3333*   |
+|    Beta      |  0.9616 / 0.9669     | 0.8640 / 0.9799     |  0.5765*     |  0.7751*   |
+|    Random    |  0.9651 / 0.9700     | 0.8638 / 0.9802     |  0.8800*     |  0.7795    |
+|  Stepwise    |  0.9642 / 0.9693     | 0.8726 / 0.9818     |  0.8825*     |  0.7799    |
+|   Gaussian   |  0.9644 / 0.9692     | **0.8779 / 0.9820** |  0.8875*     |  0.7843    |
+| Random-512   |  0.9636 /  0.9690    | 0.8664 / 0.9806     |  0.6735*     |  0.7799    |
+| Gaussian-512 |  0.9646 / 0.9697     | 0.8707 / 0.9810     | **0.8965** * |  0.7843    |
 </figure>
 The performance of our models has been, in general, very good. Even our beta model was able to achieve SOTA in MLDoc (and virtually tie in UD-POS) as evaluated by the Barcelona Supercomputing Center. In the main masked-language task our models reach values between 0.65 and 0.69, which foretells good results for downstream tasks.
+Our analysis of downstream tasks is not yet complete. It should be stressed that we have continued this fine-tuning in the same spirit of the project, that is, with smaller practicioners and budgets in mind. Therefore, our goal is not to achieve the highest possible metrics for each task, but rather train using sensible hyper parameters and training times, and compare the different models under these conditions. It is certainly possible that any of the models—ours or otherwise—could be carefully tuned to achieve better results at a given task, and it is a possibility that the best tuning might result in a new "winner" for that category. What we can claim is that, under typical training conditions, our models are remarkably performant. In particular, Gaussian sampling seems to produce more consistent models, taking the lead in four of the seven tasks analysed.
+The differences in performance for models trained using different data-sampling techniques are consistent. Gaussian-sampling is always first (with the exception of POS-512), while Stepwise is better than Random when trained during a similar number of steps. This proves that the sampling technique is, indeed, relevant.
 As already mentiond in the Training details section, the methodology used to extend sequence length during training is critical. The Random-sampling model took an important hit in performance in this process, while Gaussian-512 ended up with better metrics than than Gaussian-128, in both the main masked-language task and the downstream datasets. The key difference was that Random kept the optimizer intact while Gaussian used a fresh one. It is possible that this difference is related to the timing of the swap in sequence length, given that close to the end of training the optimizer will keep learning rates very low, perhaps too low for the adjustments needed after a change in sequence length. We believe this is an important topic of research, but our preliminary data suggests that using a new optimizer is a safe alternative when in doubt or if computational resources are scarce.

evaluation/paws.yaml CHANGED Viewed

@@ -36,8 +36,6 @@ parameters:
     value: ./outputs
   overwrite_output_dir:
     value: true
-  resume_from_checkpoint:
-    value: false
   max_seq_length:
     value: 512
   pad_to_max_length:

     value: ./outputs
   overwrite_output_dir:
     value: true
   max_seq_length:
     value: 512
   pad_to_max_length:

evaluation/token.yaml ADDED Viewed

	@@ -0,0 +1,53 @@

+name: BERTIN NER and POS es
+project: bertin-eval
+enitity: versae
+program: run_ner.py
+command:
+  - ${env}
+  - ${interpreter}
+  - ${program}
+  - ${args}
+method: grid
+metric:
+  name: eval/accuracy
+  goal: maximize
+parameters:
+  model_name_or_path:
+    values:
+    - bertin-project/bertin-base-gaussian-exp-512seqlen
+    - bertin-project/bertin-base-random-exp-512seqlen
+    - bertin-project/bertin-base-gaussian
+    - bertin-project/bertin-base-stepwise
+    - bertin-project/bertin-base-random
+    - bertin-project/bertin-roberta-base-spanish
+    - flax-community/bertin-roberta-large-spanish
+    - BSC-TeMU/roberta-base-bne
+    - dccuchile/bert-base-spanish-wwm-cased
+    - bert-base-multilingual-cased
+  num_train_epochs:
+    values: [5]
+  task_name:
+    values:
+    - ner
+    - pos
+  dataset_name:
+    value: conll2002
+  dataset_config_name:
+    value: es
+  output_dir:
+    value: ./outputs
+  overwrite_output_dir:
+    value: true
+  pad_to_max_length:
+    value: true
+  per_device_train_batch_size:
+    value: 16
+  per_device_eval_batch_size:
+    value: 16
+  save_total_limit:
+    value: 1
+  do_train:
+    value: true
+  do_eval:
+    value: true

evaluation/xnli.yaml CHANGED Viewed

@@ -36,8 +36,6 @@ parameters:
     value: ./outputs
   overwrite_output_dir:
     value: true
-  resume_from_checkpoint:
-    value: false
   max_seq_length:
     value: 512
   pad_to_max_length:

     value: ./outputs
   overwrite_output_dir:
     value: true
   max_seq_length:
     value: 512
   pad_to_max_length:

images/bertin.png CHANGED Viewed

run_mlm_flax_stream.py CHANGED Viewed

@@ -384,8 +384,8 @@ def to_f32(t):
 def convert(output_dir, destination_dir="./"):
-    shutil.copyfile(Path(output_dir) / "flax_model.msgpack", destination_dir)
-    shutil.copyfile(Path(output_dir) / "config.json", destination_dir)
     # Saving extra files from config.json and tokenizer.json files
     tokenizer = AutoTokenizer.from_pretrained(destination_dir)
     tokenizer.save_pretrained(destination_dir)
@@ -611,8 +611,8 @@ if __name__ == "__main__":
     # Setup train state
     state = train_state.TrainState.create(apply_fn=model.__call__, params=model.params, tx=adamw)
-    saved_step = 0
-    if "checkpoint" in model_args.model_name_or_path:
         params, opt_state, saved_step, args, data_collator = restore_checkpoint(model_args.model_name_or_path, state)
         # Create learning rate schedule
         warmup_fn = optax.linear_schedule(
@@ -714,8 +714,9 @@ if __name__ == "__main__":
     max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length)
     eval_samples = advance_iter_and_group_samples(training_iter, data_args.num_eval_samples, max_seq_length)
     steps = tqdm(range(num_train_steps), desc="Training...", position=0)
-    for step in range(saved_step, num_train_steps):
         if step < saved_step:
             steps.update(1)
             continue
@@ -827,5 +828,5 @@ if __name__ == "__main__":
             training_args.output_dir,
             params=params,
             push_to_hub=training_args.push_to_hub,
-            commit_message=last_desc,
         )

 def convert(output_dir, destination_dir="./"):
+    shutil.copyfile(Path(output_dir) / "flax_model.msgpack", Path(destination_dir) / "flax_model.msgpack")
+    shutil.copyfile(Path(output_dir) / "config.json", Path(destination_dir) / "config.json")
     # Saving extra files from config.json and tokenizer.json files
     tokenizer = AutoTokenizer.from_pretrained(destination_dir)
     tokenizer.save_pretrained(destination_dir)
     # Setup train state
     state = train_state.TrainState.create(apply_fn=model.__call__, params=model.params, tx=adamw)
+    saved_step = -1
+    if model_args.model_name_or_path and "checkpoint" in model_args.model_name_or_path:
         params, opt_state, saved_step, args, data_collator = restore_checkpoint(model_args.model_name_or_path, state)
         # Create learning rate schedule
         warmup_fn = optax.linear_schedule(
     max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length)
     eval_samples = advance_iter_and_group_samples(training_iter, data_args.num_eval_samples, max_seq_length)
+    last_desc = ""
     steps = tqdm(range(num_train_steps), desc="Training...", position=0)
+    for step in range(num_train_steps):
         if step < saved_step:
             steps.update(1)
             continue
             training_args.output_dir,
             params=params,
             push_to_hub=training_args.push_to_hub,
+            commit_message=last_desc or "Saving model after training",
         )

utils/download_mc4es_sampled.py ADDED Viewed

	@@ -0,0 +1,32 @@

+import io
+import gzip
+import json
+import sys
+import requests
+from tqdm import tqdm
+_DATA_URL_TRAIN = "https://huggingface.co/datasets/bertin-project/mc4-es-sampled/resolve/main/mc4-es-train-50M-{config}-shard-{index:04d}-of-{n_shards:04d}.json.gz"
+def main(config="stepwise"):
+    data_urls = [
+        _DATA_URL_TRAIN.format(
+            config=config,
+            index=index + 1,
+            n_shards=1024,
+        )
+        for index in range(1024)
+    ]
+    with open(f"mc4-es-train-50M-{config}.jsonl", "w") as f:
+        for dara_url in tqdm(data_urls):
+            response = requests.get(dara_url)
+            bio = io.BytesIO(response.content)
+            with gzip.open(bio, "rt", encoding="utf8") as g:
+                for line in g:
+                    json_line = json.loads(line.strip())
+                    f.write(json.dumps(json_line) + "\n")
+if __name__ == "__main__":
+    main(sys.argv[1])