eduagarcia commited on
Commit
28ba6d5
1 Parent(s): 9e761cf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -11
README.md CHANGED
@@ -57,7 +57,7 @@ metrics:
57
  ---
58
  # RoBERTaLexPT-base
59
 
60
- RoBERTaLexPT-base is pretrained from , using [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
61
 
62
 
63
  ## Model Details
@@ -78,23 +78,34 @@ RoBERTaLexPT-base is pretrained from , using [RoBERTa-base](https://huggingface.
78
  ## Training Details
79
 
80
  ### Training Data
 
 
 
81
 
82
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
83
 
84
- [More Information Needed]
 
85
 
86
- ### Training Procedure
87
 
88
- The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
89
  This computational setup is similar to the work of [BERTimbau](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28), exposing the model to approximately 65 billion tokens during training.
90
 
91
- #### Preprocessing [optional]
92
 
93
- [More Information Needed]
 
 
94
 
95
 
96
  #### Training Hyperparameters
97
 
 
 
 
 
 
 
 
98
  | **Hyperparameter** | **RoBERTa-base** |
99
  |------------------------|-----------------:|
100
  | Number of layers | 12 |
@@ -123,9 +134,7 @@ This computational setup is similar to the work of [BERTimbau](https://dl.acm.or
123
 
124
  #### Testing Data
125
 
126
- <!-- This should link to a Dataset Card if possible. -->
127
-
128
- [More Information Needed]
129
 
130
  #### Metrics
131
 
@@ -142,6 +151,5 @@ This computational setup is similar to the work of [BERTimbau](https://dl.acm.or
142
 
143
  ## Citation
144
 
145
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
146
 
147
  [More Information Needed]
 
57
  ---
58
  # RoBERTaLexPT-base
59
 
60
+ RoBERTaLexPT-base is pretrained from LegalPT corpus and CrawlPT corpus, using [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
61
 
62
 
63
  ## Model Details
 
78
  ## Training Details
79
 
80
  ### Training Data
81
+ RoBERTaLexPT-base is pretrained from both data:
82
+ - [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
83
+ - CrawlPT is a duplication of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/eduagarcia/brwac_dedup), [CC100-PT](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).
84
 
85
+ ### Training Procedure
86
 
87
+ Our pretraining process was executed using the [Fairseq library](https://arxiv.org/abs/1904.01038) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs.
88
+ The complete training of a single configuration takes approximately three days.
89
 
 
90
 
 
91
  This computational setup is similar to the work of [BERTimbau](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28), exposing the model to approximately 65 billion tokens during training.
92
 
93
+ #### Preprocessing
94
 
95
+ Following the approach of [Lee et al. (2022)](http://arxiv.org/abs/2107.06499), we deduplicated all subsets of the LegalPT Corpus using the [MinHash algorithm](https://dl.acm.org/doi/abs/10.5555/647819.736184) and [Locality Sensitive Hashing](https://dspace.mit.edu/bitstream/handle/1721.1/134231/v008a014.pdf?sequence=2&isAllowed=y) to find clusters of duplicate documents.
96
+
97
+ To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
98
 
99
 
100
  #### Training Hyperparameters
101
 
102
+ The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
103
+ We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
104
+ The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
105
+
106
+ We adopted the standard [RoBERTa hyperparameters](https://arxiv.org/abs/1907.11692):
107
+
108
+
109
  | **Hyperparameter** | **RoBERTa-base** |
110
  |------------------------|-----------------:|
111
  | Number of layers | 12 |
 
134
 
135
  #### Testing Data
136
 
137
+ The model was evaluated on ["PortuLex" benchmark](eduagarcia/portuguese_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
 
 
138
 
139
  #### Metrics
140
 
 
151
 
152
  ## Citation
153
 
 
154
 
155
  [More Information Needed]