|
--- |
|
datasets: |
|
- eduagarcia/LegalPT_dedup |
|
- eduagarcia/CrawlPT_dedup |
|
language: |
|
- pt |
|
pipeline_tag: fill-mask |
|
tags: |
|
- legal |
|
model-index: |
|
- name: RoBERTaLexPT-base |
|
results: |
|
- task: |
|
type: token-classification |
|
dataset: |
|
type: lener_br |
|
name: lener_br |
|
split: test |
|
metrics: |
|
- type: seqeval |
|
value: 0.9073 |
|
name: F1 |
|
args: |
|
scheme: IOB2 |
|
- task: |
|
type: token-classification |
|
dataset: |
|
type: eduagarcia/PortuLex_benchmark |
|
name: UlyNER-PL Coarse |
|
config: UlyssesNER-Br-PL-coarse |
|
split: test |
|
metrics: |
|
- type: seqeval |
|
value: 0.8856 |
|
name: F1 |
|
args: |
|
scheme: IOB2 |
|
- task: |
|
type: token-classification |
|
dataset: |
|
type: eduagarcia/PortuLex_benchmark |
|
name: UlyNER-PL Fine |
|
config: UlyssesNER-Br-PL-fine |
|
split: test |
|
metrics: |
|
- type: seqeval |
|
value: 0.8603 |
|
name: F1 |
|
args: |
|
scheme: IOB2 |
|
- task: |
|
type: token-classification |
|
dataset: |
|
type: eduagarcia/PortuLex_benchmark |
|
name: FGV-STF |
|
config: fgv-coarse |
|
split: test |
|
metrics: |
|
- type: seqeval |
|
value: 0.8040 |
|
name: F1 |
|
args: |
|
scheme: IOB2 |
|
- task: |
|
type: token-classification |
|
dataset: |
|
type: eduagarcia/PortuLex_benchmark |
|
name: RRIP |
|
config: rrip |
|
split: test |
|
metrics: |
|
- type: seqeval |
|
value: 0.8322 |
|
name: F1 |
|
args: |
|
scheme: IOB2 |
|
- task: |
|
type: token-classification |
|
dataset: |
|
type: eduagarcia/PortuLex_benchmark |
|
name: PortuLex |
|
split: test |
|
metrics: |
|
- type: seqeval |
|
value: 0.8541 |
|
name: Average F1 |
|
args: |
|
scheme: IOB2 |
|
license: cc-by-4.0 |
|
metrics: |
|
- seqeval |
|
--- |
|
# RoBERTaLexPT-base |
|
|
|
RoBERTaLexPT-base is a Portuguese Masked Language Model pretrained from scratch from the [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) and [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) corpora, using the same architecture as [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by Liu et al. (2019). |
|
|
|
- **Language(s) (NLP):** Portuguese (pt-BR and pt-PT) |
|
- **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en) |
|
- **Repository:** https://github.com/eduagarcia/roberta-legal-portuguese |
|
- **Paper:** https://aclanthology.org/2024.propor-1.38/ |
|
|
|
## Evaluation |
|
|
|
The model was evaluated on ["PortuLex" benchmark](https://huggingface.co/datasets/eduagarcia/PortuLex_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain. |
|
|
|
Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits: |
|
|
|
| **Model** | **LeNER** | **UlyNER-PL** | **FGV-STF** | **RRIP** | **Average (%)** | |
|
|----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------| |
|
| | | Coarse/Fine | Coarse | | | |
|
| [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased) | 88.34 | 86.39/83.83 | 79.34 | 82.34 | 83.78 | |
|
| [BERTimbau-large](https://huggingface.co/neuralmind/bert-large-portuguese-cased) | 88.64 | 87.77/84.74 | 79.71 | **83.79** | 84.60 | |
|
| [Albertina-PT-BR-base](https://huggingface.co/PORTULAN/albertina-ptbr-based) | 89.26 | 86.35/84.63 | 79.30 | 81.16 | 83.80 | |
|
| [Albertina-PT-BR-xlarge](https://huggingface.co/PORTULAN/albertina-ptbr) | 90.09 | 88.36/**86.62** | 79.94 | 82.79 | 85.08 | |
|
| [BERTikal-base](https://huggingface.co/felipemaiapolo/legalnlp-bert) | 83.68 | 79.21/75.70 | 77.73 | 81.11 | 79.99 | |
|
| [JurisBERT-base](https://huggingface.co/alfaneo/jurisbert-base-portuguese-uncased) | 81.74 | 81.67/77.97 | 76.04 | 80.85 | 79.61 | |
|
| [BERTimbauLAW-base](https://huggingface.co/alfaneo/bertimbaulaw-base-portuguese-cased) | 84.90 | 87.11/84.42 | 79.78 | 82.35 | 83.20 | |
|
| [Legal-XLM-R-base](https://huggingface.co/joelniklaus/legal-xlm-roberta-base) | 87.48 | 83.49/83.16 | 79.79 | 82.35 | 83.24 | |
|
| [Legal-XLM-R-large](https://huggingface.co/joelniklaus/legal-xlm-roberta-large) | 88.39 | 84.65/84.55 | 79.36 | 81.66 | 83.50 | |
|
| [Legal-RoBERTa-PT-large](https://huggingface.co/joelniklaus/legal-portuguese-roberta-large) | 87.96 | 88.32/84.83 | 79.57 | 81.98 | 84.02 | |
|
| **Ours** | | | | | | |
|
| RoBERTaTimbau-base (Reproduction of BERTimbau) | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 | |
|
| RoBERTaLegalPT-base (Trained on LegalPT) | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 | |
|
| [RoBERTaCrawlPT-base](https://huggingface.co/eduagarcia/RoBERTaCrawlPT-base) (Trained on CrawlPT) | 89.24 | 88.22/86.58 | 79.88 | 82.80 | 84.83 | |
|
| **RoBERTaLexPT-base (this)** (Trained on CrawlPT + LegalPT) | **90.73** | **88.56**/86.03 | **80.40** | 83.22 | **85.41** | |
|
|
|
In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size. |
|
With sufficient pre-training data, it can surpass larger models. The results highlight the importance of domain-diverse training data over sheer model scale. |
|
|
|
## Training Details |
|
|
|
RoBERTaLexPT-base is pretrained on: |
|
- [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data. |
|
- [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) is a composition of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/brwac), [CC100 PT subset](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301 PT subset](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup). |
|
|
|
### Training Procedure |
|
|
|
Our pretraining process was executed using the [Fairseq library v0.10.2](https://github.com/facebookresearch/fairseq/tree/v0.10.2) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs. |
|
The complete training of a single configuration takes approximately three days. |
|
|
|
|
|
This computational cost is similar to the work of [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased), exposing the model to approximately 65 billion tokens during training. |
|
|
|
#### Preprocessing |
|
|
|
We deduplicated all subsets of the LegalPT and CrawlPT Corpus using the a MinHash algorithm and Locality Sensitive Hashing implementation from the libary [text-dedup](https://github.com/ChenghaoMou/text-dedup) to find clusters of duplicate documents. |
|
|
|
To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used. |
|
|
|
#### Training Hyperparameters |
|
|
|
The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 and a learning rate of 4e-4, each sequence containing a maximum of 512 tokens. |
|
The weight initialization is random. |
|
We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked. |
|
The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule. |
|
|
|
For other parameters we adopted the standard [RoBERTa-base hyperparameters](https://huggingface.co/FacebookAI/roberta-base): |
|
|
|
|
|
| **Hyperparameter** | **RoBERTa-base** | |
|
|------------------------|-----------------:| |
|
| Number of layers | 12 | |
|
| Hidden size | 768 | |
|
| FFN inner hidden size | 3072 | |
|
| Attention heads | 12 | |
|
| Attention head size | 64 | |
|
| Dropout | 0.1 | |
|
| Attention dropout | 0.1 | |
|
| Warmup steps | 6k | |
|
| Peak learning rate | 4e-4 | |
|
| Batch size | 2048 | |
|
| Weight decay | 0.01 | |
|
| Maximum training steps | 62.5k | |
|
| Learning rate decay | Linear | |
|
| AdamW $$\epsilon$$ | 1e-6 | |
|
| AdamW $$\beta_1$$ | 0.9 | |
|
| AdamW $$\beta_2$$ | 0.98 | |
|
| Gradient clipping | 0.0 | |
|
|
|
## Citation |
|
|
|
``` |
|
@inproceedings{garcia-etal-2024-robertalexpt, |
|
title = "{R}o{BERT}a{L}ex{PT}: A Legal {R}o{BERT}a Model pretrained with deduplication for {P}ortuguese", |
|
author = "Garcia, Eduardo A. S. and |
|
Silva, Nadia F. F. and |
|
Siqueira, Felipe and |
|
Albuquerque, Hidelberg O. and |
|
Gomes, Juliana R. S. and |
|
Souza, Ellen and |
|
Lima, Eliomar A.", |
|
editor = "Gamallo, Pablo and |
|
Claro, Daniela and |
|
Teixeira, Ant{\'o}nio and |
|
Real, Livy and |
|
Garcia, Marcos and |
|
Oliveira, Hugo Gon{\c{c}}alo and |
|
Amaro, Raquel", |
|
booktitle = "Proceedings of the 16th International Conference on Computational Processing of Portuguese", |
|
month = mar, |
|
year = "2024", |
|
address = "Santiago de Compostela, Galicia/Spain", |
|
publisher = "Association for Computational Lingustics", |
|
url = "https://aclanthology.org/2024.propor-1.38", |
|
pages = "374--383", |
|
} |
|
``` |
|
|
|
## Acknowledgment |
|
|
|
This work has been supported by the AI Center of Excellence (Centro de Excelência em Inteligência Artificial – CEIA) of the Institute of Informatics at the Federal University of Goiás (INF-UFG). |