eduagarcia
commited on
Commit
•
d789b2b
1
Parent(s):
ecd611d
Update README.md
Browse files
README.md
CHANGED
@@ -108,16 +108,16 @@ Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test spl
|
|
108 |
| **Model** | **LeNER** | **UlyNER-PL** | **FGV-STF** | **RRIP** | **Average (%)** |
|
109 |
|----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
|
110 |
| | | Coarse/Fine | Coarse | | |
|
111 |
-
| [BERTimbau-base](https://
|
112 |
-
| [BERTimbau-large](https://
|
113 |
-
| [Albertina-PT-BR-base](https://
|
114 |
-
| [Albertina-PT-BR-xlarge](https://
|
115 |
-
| [BERTikal-base](https://
|
116 |
-
| [JurisBERT-base](https://
|
117 |
-
| [BERTimbauLAW-base](https://
|
118 |
-
| [Legal-XLM-R-base](https://
|
119 |
-
| [Legal-XLM-R-large](https://
|
120 |
-
| [Legal-RoBERTa-PT-large](https://
|
121 |
| **Ours** | | | | | |
|
122 |
| RoBERTaTimbau-base (Reproduction of BERTimbau) | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 |
|
123 |
| RoBERTaLegalPT-base (Trained on LegalPT) | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 |
|
@@ -131,29 +131,30 @@ With sufficient pre-training data, it can surpass larger models. The results hig
|
|
131 |
|
132 |
RoBERTaLexPT-base is pretrained from both data:
|
133 |
- [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
|
134 |
-
- [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) is a composition of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/
|
135 |
|
136 |
### Training Procedure
|
137 |
|
138 |
-
Our pretraining process was executed using the [Fairseq library](https://
|
139 |
The complete training of a single configuration takes approximately three days.
|
140 |
|
141 |
|
142 |
-
This computational
|
143 |
|
144 |
#### Preprocessing
|
145 |
|
146 |
-
|
147 |
|
148 |
To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
|
149 |
|
150 |
#### Training Hyperparameters
|
151 |
|
152 |
-
The pretraining process involved training the model for 62,500 steps, with a batch size of 2048
|
|
|
153 |
We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
|
154 |
The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
|
155 |
|
156 |
-
For other
|
157 |
|
158 |
|
159 |
| **Hyperparameter** | **RoBERTa-base** |
|
|
|
108 |
| **Model** | **LeNER** | **UlyNER-PL** | **FGV-STF** | **RRIP** | **Average (%)** |
|
109 |
|----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
|
110 |
| | | Coarse/Fine | Coarse | | |
|
111 |
+
| [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased) | 88.34 | 86.39/83.83 | 79.34 | 82.34 | 83.78 |
|
112 |
+
| [BERTimbau-large](https://huggingface.co/neuralmind/bert-large-portuguese-cased) | 88.64 | 87.77/84.74 | 79.71 | **83.79** | 84.60 |
|
113 |
+
| [Albertina-PT-BR-base](https://huggingface.co/PORTULAN/albertina-ptbr-based) | 89.26 | 86.35/84.63 | 79.30 | 81.16 | 83.80 |
|
114 |
+
| [Albertina-PT-BR-xlarge](https://huggingface.co/PORTULAN/albertina-ptbr) | 90.09 | 88.36/**86.62** | 79.94 | 82.79 | 85.08 |
|
115 |
+
| [BERTikal-base](https://huggingface.co/felipemaiapolo/legalnlp-bert) | 83.68 | 79.21/75.70 | 77.73 | 81.11 | 79.99 |
|
116 |
+
| [JurisBERT-base](https://huggingface.co/alfaneo/jurisbert-base-portuguese-uncased) | 81.74 | 81.67/77.97 | 76.04 | 80.85 | 79.61 |
|
117 |
+
| [BERTimbauLAW-base](https://huggingface.co/alfaneo/bertimbaulaw-base-portuguese-cased) | 84.90 | 87.11/84.42 | 79.78 | 82.35 | 83.20 |
|
118 |
+
| [Legal-XLM-R-base](https://huggingface.co/joelniklaus/legal-xlm-roberta-base) | 87.48 | 83.49/83.16 | 79.79 | 82.35 | 83.24 |
|
119 |
+
| [Legal-XLM-R-large](https://huggingface.co/joelniklaus/legal-xlm-roberta-large) | 88.39 | 84.65/84.55 | 79.36 | 81.66 | 83.50 |
|
120 |
+
| [Legal-RoBERTa-PT-large](https://huggingface.co/joelniklaus/legal-portuguese-roberta-large) | 87.96 | 88.32/84.83 | 79.57 | 81.98 | 84.02 |
|
121 |
| **Ours** | | | | | |
|
122 |
| RoBERTaTimbau-base (Reproduction of BERTimbau) | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 |
|
123 |
| RoBERTaLegalPT-base (Trained on LegalPT) | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 |
|
|
|
131 |
|
132 |
RoBERTaLexPT-base is pretrained from both data:
|
133 |
- [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
|
134 |
+
- [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) is a composition of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/brwac), [CC100 PT subset](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301 PT subset](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).
|
135 |
|
136 |
### Training Procedure
|
137 |
|
138 |
+
Our pretraining process was executed using the [Fairseq library v0.10.2](https://github.com/facebookresearch/fairseq/tree/v0.10.2) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs.
|
139 |
The complete training of a single configuration takes approximately three days.
|
140 |
|
141 |
|
142 |
+
This computational cost is similar to the work of [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased), exposing the model to approximately 65 billion tokens during training.
|
143 |
|
144 |
#### Preprocessing
|
145 |
|
146 |
+
We deduplicated all subsets of the LegalPT and CrawlPT Corpus using the a MinHash algorithm and Locality Sensitive Hashing implementation from the libary [text-dedup](https://github.com/ChenghaoMou/text-dedup) to find clusters of duplicate documents.
|
147 |
|
148 |
To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
|
149 |
|
150 |
#### Training Hyperparameters
|
151 |
|
152 |
+
The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 and a learning rate of 4e-4, each sequence containing a maximum of 512 tokens.
|
153 |
+
The weight initialization is random.
|
154 |
We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
|
155 |
The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
|
156 |
|
157 |
+
For other parameters we adopted the standard [RoBERTa-base hyperparameters](https://huggingface.co/FacebookAI/roberta-base):
|
158 |
|
159 |
|
160 |
| **Hyperparameter** | **RoBERTa-base** |
|