jarodrigues commited on
Commit
5bbdd69
1 Parent(s): 1c7bc3e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -4
README.md CHANGED
@@ -58,16 +58,16 @@ We skipped the default filtering of stopwords since it would disrupt the syntact
58
  As codebase, we resorted to the [DeBERTa V2 XLarge](https://huggingface.co/microsoft/deberta-v2-xlarge), for English.
59
 
60
  To train **Albertina-PT-BR** the BrWac data set was tokenized with the original DeBERTA tokenizer with a 128 token sequence truncation and dynamic padding.
61
- The model was trained using the maximum available memory capacity -- it was trained for 1 day and 11 hours on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs and 1.360 GB of RAM -- resulting in a batch size of 896 samples (56 samples per GPU without gradient accumulation steps).
62
  We chose a learning rate of 1e-5 with linear decay and 10k warm-up steps based on the results of exploratory experiments.
63
  In total, around 200k training steps were taken across 50 epochs.
64
- Additionally, we used the standard BERT masking procedure with a 15% masking probability for each example.
65
 
66
  To train **Albertina-PT-PT**, the data set was tokenized with the original DeBERTa tokenizer with a 128 token sequence truncation and dynamic padding.
67
- The model was trained using the maximum available memory capacity -- it was trained for 3 days on a2-highgpu-8gb Google Cloud A2 VMs with 8 GPUs, 96 vCPUs and 680 GB of RAM -- resulting in a batch size of 832 samples (52 samples per GPU and applying gradient accumulation in order to approximate the batch size of the PT-BR model).
68
  Similarly to the PT-BR variant above, we opted for a learning rate of 1e-5 with linear decay and 10k warm-up steps.
69
  However, since the number of training examples is approximately twice of that in the PT-BR variant, we reduced the number of training epochs to half and completed only 25 epochs, which resulted in approximately 245k steps.
70
-
71
 
72
  # Evaluation
73
 
@@ -173,3 +173,6 @@ If Albertina proves useful for your work, we kindly ask that you cite the follow
173
  }
174
  ```
175
 
 
 
 
 
58
  As codebase, we resorted to the [DeBERTa V2 XLarge](https://huggingface.co/microsoft/deberta-v2-xlarge), for English.
59
 
60
  To train **Albertina-PT-BR** the BrWac data set was tokenized with the original DeBERTA tokenizer with a 128 token sequence truncation and dynamic padding.
61
+ The model was trained using the maximum available memory capacity resulting in a batch size of 896 samples (56 samples per GPU without gradient accumulation steps).
62
  We chose a learning rate of 1e-5 with linear decay and 10k warm-up steps based on the results of exploratory experiments.
63
  In total, around 200k training steps were taken across 50 epochs.
64
+ The model was trained for 1 day and 11 hours on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs and 1.360 GB of RAM.
65
 
66
  To train **Albertina-PT-PT**, the data set was tokenized with the original DeBERTa tokenizer with a 128 token sequence truncation and dynamic padding.
67
+ The model was trained using the maximum available memory capacity resulting in a batch size of 832 samples (52 samples per GPU and applying gradient accumulation in order to approximate the batch size of the PT-BR model).
68
  Similarly to the PT-BR variant above, we opted for a learning rate of 1e-5 with linear decay and 10k warm-up steps.
69
  However, since the number of training examples is approximately twice of that in the PT-BR variant, we reduced the number of training epochs to half and completed only 25 epochs, which resulted in approximately 245k steps.
70
+ The model was trained for 3 days on a2-highgpu-8gb Google Cloud A2 VMs with 8 GPUs, 96 vCPUs and 680 GB of RAM.
71
 
72
  # Evaluation
73
 
 
173
  }
174
  ```
175
 
176
+ # Acknowledgments
177
+
178
+ TODO