jarodrigues
commited on
Commit
•
cb47993
1
Parent(s):
cba1aa1
Update README.md
Browse files
README.md
CHANGED
@@ -107,7 +107,7 @@ As codebase, we resorted to the [DeBERTa V2 XLarge](https://huggingface.co/micro
|
|
107 |
|
108 |
To train **Albertina-PT-PT**, the data set was tokenized with the original DeBERTa tokenizer with a 128 token sequence truncation and dynamic padding.
|
109 |
The model was trained using the maximum available memory capacity resulting in a batch size of 832 samples (52 samples per GPU and applying gradient accumulation in order to approximate the batch size of the PT-BR model).
|
110 |
-
Similarly to the PT-BR variant
|
111 |
However, since the number of training examples is approximately twice of that in the PT-BR variant, we reduced the number of training epochs to half and completed only 25 epochs, which resulted in approximately 245k steps.
|
112 |
The model was trained for 3 days on a2-highgpu-8gb Google Cloud A2 VMs with 8 GPUs, 96 vCPUs and 680 GB of RAM.
|
113 |
|
|
|
107 |
|
108 |
To train **Albertina-PT-PT**, the data set was tokenized with the original DeBERTa tokenizer with a 128 token sequence truncation and dynamic padding.
|
109 |
The model was trained using the maximum available memory capacity resulting in a batch size of 832 samples (52 samples per GPU and applying gradient accumulation in order to approximate the batch size of the PT-BR model).
|
110 |
+
Similarly to the PT-BR variant, we opted for a learning rate of 1e-5 with linear decay and 10k warm-up steps.
|
111 |
However, since the number of training examples is approximately twice of that in the PT-BR variant, we reduced the number of training epochs to half and completed only 25 epochs, which resulted in approximately 245k steps.
|
112 |
The model was trained for 3 days on a2-highgpu-8gb Google Cloud A2 VMs with 8 GPUs, 96 vCPUs and 680 GB of RAM.
|
113 |
|