eduagarcia commited on
Commit
9e761cf
1 Parent(s): fbad57e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -5
README.md CHANGED
@@ -57,9 +57,8 @@ metrics:
57
  ---
58
  # RoBERTaLexPT-base
59
 
60
- <!-- Provide a quick summary of what the model is/does. -->
61
 
62
- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
63
 
64
  ## Model Details
65
 
@@ -86,7 +85,8 @@ This modelcard aims to be a base template for new models. It has been generated
86
 
87
  ### Training Procedure
88
 
89
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
 
90
 
91
  #### Preprocessing [optional]
92
 
@@ -95,8 +95,25 @@ This modelcard aims to be a base template for new models. It has been generated
95
 
96
  #### Training Hyperparameters
97
 
98
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
99
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
 
101
  ## Evaluation
102
 
 
57
  ---
58
  # RoBERTaLexPT-base
59
 
60
+ RoBERTaLexPT-base is pretrained from , using [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
61
 
 
62
 
63
  ## Model Details
64
 
 
85
 
86
  ### Training Procedure
87
 
88
+ The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
89
+ This computational setup is similar to the work of [BERTimbau](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28), exposing the model to approximately 65 billion tokens during training.
90
 
91
  #### Preprocessing [optional]
92
 
 
95
 
96
  #### Training Hyperparameters
97
 
98
+ | **Hyperparameter** | **RoBERTa-base** |
99
+ |------------------------|-----------------:|
100
+ | Number of layers | 12 |
101
+ | Hidden size | 768 |
102
+ | FFN inner hidden size | 3072 |
103
+ | Attention heads | 12 |
104
+ | Attention head size | 64 |
105
+ | Dropout | 0.1 |
106
+ | Attention dropout | 0.1 |
107
+ | Warmup steps | 6k |
108
+ | Peak learning rate | 4e-4 |
109
+ | Batch size | 2048 |
110
+ | Weight decay | 0.01 |
111
+ | Maximum training steps | 62.5k |
112
+ | Learning rate decay | Linear |
113
+ | AdamW $$\epsilon$$ | 1e-6 |
114
+ | AdamW $$\beta_1$$ | 0.9 |
115
+ | AdamW $$\beta_2$$ | 0.98 |
116
+ | Gradient clipping | 0.0 |
117
 
118
  ## Evaluation
119