File size: 9,769 Bytes
3171fcb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e15643e
3171fcb
e15643e
3171fcb
 
6ec58fd
3171fcb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d39042
3171fcb
 
 
 
 
5ced76d
3171fcb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
375c5bd
6ec58fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3171fcb
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
---
datasets:
- eduagarcia/CrawlPT_dedup
language:
- pt
pipeline_tag: fill-mask
model-index:
- name: RoBERTaCrawlPT-base
  results:
  - task:
      type: token-classification
    dataset:
      type: lener_br
      name: lener_br
      split: test
    metrics:
    - type: seqeval
      value: 0.8924
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: UlyNER-PL Coarse
      config: UlyssesNER-Br-PL-coarse
      split: test
    metrics:
    - type: seqeval
      value: 0.8822
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: UlyNER-PL Fine
      config: UlyssesNER-Br-PL-fine
      split: test
    metrics:
    - type: seqeval
      value: 0.8658
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: FGV-STF
      config: fgv-coarse
      split: test
    metrics:
    - type: seqeval
      value: 0.7988
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: RRIP
      config: rrip
      split: test
    metrics:
    - type: seqeval
      value: 0.8280
      name: F1
      args:
        scheme: IOB2
  - task:
      type: token-classification
    dataset:
      type: eduagarcia/PortuLex_benchmark
      name: PortuLex
      split: test
    metrics:
    - type: seqeval
      value: 0.8483
      name: Average F1
      args:
        scheme: IOB2
license: cc-by-4.0
metrics:
- seqeval
---
# RoBERTaCrawlPT-base

RoBERTaCrawlPT-base is a generic Portuguese Masked Language Model pretrained from scratch from the [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) corpora, using the same architecture as [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base).
This model is part of the [RoBERTaLexPT](https://huggingface.co/eduagarcia/RoBERTaLegalPT-base) [work](https://aclanthology.org/2024.propor-1.38/).

- **Language(s) (NLP):** Portuguese (pt-BR mainly)
- **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
- **Repository:** https://github.com/eduagarcia/roberta-legal-portuguese
- **Paper:** https://aclanthology.org/2024.propor-1.38/

## Generic Evaluation

TO-DO...

## Legal Evaluation

The model was evaluated on ["PortuLex" benchmark](https://huggingface.co/datasets/eduagarcia/PortuLex_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.

Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:

| **Model**                                                                  | **LeNER** | **UlyNER-PL**   | **FGV-STF** |  **RRIP** | **Average (%)** |
|----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
|                                                                            |           | Coarse/Fine     | Coarse      |           |                 |
| [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased)  | 88.34     | 86.39/83.83     | 79.34       |   82.34   | 83.78           |
| [BERTimbau-large](https://huggingface.co/neuralmind/bert-large-portuguese-cased) | 88.64     | 87.77/84.74     | 79.71       | **83.79** | 84.60           |
| [Albertina-PT-BR-base](https://huggingface.co/PORTULAN/albertina-ptbr-based)                   | 89.26     | 86.35/84.63     | 79.30       |   81.16   | 83.80           |
| [Albertina-PT-BR-xlarge](https://huggingface.co/PORTULAN/albertina-ptbr)                 | 90.09     | 88.36/**86.62** | 79.94       |   82.79   | 85.08           |
| [BERTikal-base](https://huggingface.co/felipemaiapolo/legalnlp-bert)                          | 83.68     | 79.21/75.70     | 77.73       |   81.11   | 79.99           |
| [JurisBERT-base](https://huggingface.co/alfaneo/jurisbert-base-portuguese-uncased)        | 81.74     | 81.67/77.97     | 76.04       |   80.85   | 79.61           |
| [BERTimbauLAW-base](https://huggingface.co/alfaneo/bertimbaulaw-base-portuguese-cased)     | 84.90     | 87.11/84.42     | 79.78       |   82.35   | 83.20           |
| [Legal-XLM-R-base](https://huggingface.co/joelniklaus/legal-xlm-roberta-base)                       | 87.48     | 83.49/83.16     | 79.79       |   82.35   | 83.24           |
| [Legal-XLM-R-large](https://huggingface.co/joelniklaus/legal-xlm-roberta-large)                      | 88.39     | 84.65/84.55     | 79.36       |   81.66   | 83.50           |
| [Legal-RoBERTa-PT-large](https://huggingface.co/joelniklaus/legal-portuguese-roberta-large)                 | 87.96     | 88.32/84.83     | 79.57       |   81.98   | 84.02           |
| **Ours**                                                                   |           |                 |             |           |                 |
| RoBERTaTimbau-base (Reproduction of BERTimbau)                             | 89.68     | 87.53/85.74     | 78.82       |   82.03   | 84.29           |
| RoBERTaLegalPT-base (Trained on LegalPT)    | 90.59     | 85.45/84.40     | 79.92       |   82.84   | 84.57           |
| **RoBERTaCrawlPT-base (this)**  (Trained on CrawlPT)    | 89.24     | 88.22/86.58     | 79.88       |   82.80   | 84.83           |
| [RoBERTaLexPT-base](https://huggingface.co/eduagarcia/RoBERTaLexPT-base) (Trained on CrawlPT + LegalPT)                       | **90.73** | **88.56**/86.03 | **80.40**   |   83.22   | **85.41**       |


## Training Details

RoBERTaCrawlPT is pretrained on:
- [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) is a composition of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/brwac_dedup), CC100 PT subset, [OSCAR-2301 PT subset](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).

### Training Procedure

Our pretraining process was executed using the [Fairseq library v0.10.2](https://github.com/facebookresearch/fairseq/tree/v0.10.2) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs.
The complete training of a single configuration takes approximately three days.


This computational cost is similar to the work of [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased), exposing the model to approximately 65 billion tokens during training.

#### Preprocessing

We deduplicated all subsets of the CrawlPT Corpus using the a MinHash algorithm and Locality Sensitive Hashing implementation from the libary [text-dedup](https://github.com/ChenghaoMou/text-dedup) to find clusters of duplicate documents.

To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.

#### Training Hyperparameters

The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 and a learning rate of 4e-4, each sequence containing a maximum of 512 tokens.  
The weight initialization is random.  
We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.  
The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.  

For other parameters we adopted the standard [RoBERTa-base hyperparameters](https://huggingface.co/FacebookAI/roberta-base):


| **Hyperparameter**     | **RoBERTa-base** |
|------------------------|-----------------:|
| Number of layers       |               12 |
| Hidden size            |              768 |
| FFN inner hidden size  |             3072 |
| Attention heads        |               12 |
| Attention head size    |               64 |
| Dropout                |              0.1 |
| Attention dropout      |              0.1 |
| Warmup steps           |               6k |
| Peak learning rate     |             4e-4 |
| Batch size             |             2048 |
| Weight decay           |             0.01 |
| Maximum training steps |            62.5k |
| Learning rate decay    |           Linear |
| AdamW $$\epsilon$$     |             1e-6 |
| AdamW $$\beta_1$$      |              0.9 |
| AdamW $$\beta_2$$      |             0.98 |
| Gradient clipping      |              0.0 |

## Citation

```bibtex
@inproceedings{garcia-etal-2024-robertalexpt,
    title = "{R}o{BERT}a{L}ex{PT}: A Legal {R}o{BERT}a Model pretrained with deduplication for {P}ortuguese",
    author = "Garcia, Eduardo A. S.  and
      Silva, Nadia F. F.  and
      Siqueira, Felipe  and
      Albuquerque, Hidelberg O.  and
      Gomes, Juliana R. S.  and
      Souza, Ellen  and
      Lima, Eliomar A.",
    editor = "Gamallo, Pablo  and
      Claro, Daniela  and
      Teixeira, Ant{\'o}nio  and
      Real, Livy  and
      Garcia, Marcos  and
      Oliveira, Hugo Gon{\c{c}}alo  and
      Amaro, Raquel",
    booktitle = "Proceedings of the 16th International Conference on Computational Processing of Portuguese",
    month = mar,
    year = "2024",
    address = "Santiago de Compostela, Galicia/Spain",
    publisher = "Association for Computational Lingustics",
    url = "https://aclanthology.org/2024.propor-1.38",
    pages = "374--383",
}
```

## Acknowledgment

This work has been supported by the AI Center of Excellence (Centro de Excelência em Inteligência Artificial – CEIA) of the Institute of Informatics at the Federal University of Goiás (INF-UFG).