eduagarcia commited on
Commit
a197b92
1 Parent(s): fd8f5e0
Files changed (1) hide show
  1. README.md +44 -30
README.md CHANGED
@@ -62,7 +62,33 @@ RoBERTaLexPT-base is pretrained from LegalPT and CrawlPT corpora, using [RoBERTa
62
  - **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
63
  - **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
64
  - **Repository:** https://github.com/eduagarcia/roberta-legal-portuguese
65
- - **Paper:** [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
  ## Training Details
68
 
@@ -84,14 +110,13 @@ Following the approach of [Lee et al. (2022)](http://arxiv.org/abs/2107.06499),
84
 
85
  To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
86
 
87
-
88
  #### Training Hyperparameters
89
 
90
  The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
91
  We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
92
  The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
93
 
94
- We adopted the standard [RoBERTa hyperparameters](https://arxiv.org/abs/1907.11692):
95
 
96
 
97
  | **Hyperparameter** | **RoBERTa-base** |
@@ -114,35 +139,24 @@ We adopted the standard [RoBERTa hyperparameters](https://arxiv.org/abs/1907.116
114
  | AdamW $$\beta_2$$ | 0.98 |
115
  | Gradient clipping | 0.0 |
116
 
117
- ## Evaluation
118
-
119
- The model was evaluated on ["PortuLex" benchmark](eduagarcia/portuguese_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
120
-
121
- Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:
122
-
123
- | **Model** | **LeNER** | **UlyNER-PL** | **FGV-STF** | **RRIP** | **Average (%)** |
124
- |----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
125
- | | | Coarse/Fine | Coarse | | |
126
- | [BERTimbau-base](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28) | 88.34 | 86.39/83.83 | 79.34 | 82.34 | 83.78 |
127
- | [BERTimbau-large](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28) | 88.64 | 87.77/84.74 | 79.71 | **83.79** | 84.60 |
128
- | [Albertina-PT-BR-base](https://arxiv.org/abs/2305.06721) | 89.26 | 86.35/84.63 | 79.30 | 81.16 | 83.80 |
129
- | [Albertina-PT-BR-xlarge](https://arxiv.org/abs/2305.06721) | 90.09 | 88.36/**86.62** | 79.94 | 82.79 | 85.08 |
130
- | [BERTikal-base](https://arxiv.org/abs/2110.15709) | 83.68 | 79.21/75.70 | 77.73 | 81.11 | 79.99 |
131
- | [JurisBERT-base](https://repositorio.ufms.br/handle/123456789/5119) | 81.74 | 81.67/77.97 | 76.04 | 80.85 | 79.61 |
132
- | [BERTimbauLAW-base](https://repositorio.ufms.br/handle/123456789/5119) | 84.90 | 87.11/84.42 | 79.78 | 82.35 | 83.20 |
133
- | [Legal-XLM-R-base](https://arxiv.org/abs/2306.02069) | 87.48 | 83.49/83.16 | 79.79 | 82.35 | 83.24 |
134
- | [Legal-XLM-R-large](https://arxiv.org/abs/2306.02069) | 88.39 | 84.65/84.55 | 79.36 | 81.66 | 83.50 |
135
- | [Legal-RoBERTa-PT-large](https://arxiv.org/abs/2306.02069) | 87.96 | 88.32/84.83 | 79.57 | 81.98 | 84.02 |
136
- | RoBERTaTimbau-base | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 |
137
- | RoBERTaLegalPT-base | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 |
138
- | RoBERTaLexPT-base | **90.73** | **88.56**/86.03 | **80.40** | 83.22 | **85.41** |
139
-
140
- In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size.
141
- With sufficient pre-training data, it can surpass overparameterized models. The results highlight the importance of domain-diverse training data over sheer model scale.
142
-
143
  ## Citation
144
 
145
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
 
147
  ## Acknowledgment
148
 
 
62
  - **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
63
  - **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
64
  - **Repository:** https://github.com/eduagarcia/roberta-legal-portuguese
65
+ - **Paper:** [Coming soon]
66
+
67
+ ## Evaluation
68
+
69
+ The model was evaluated on ["PortuLex" benchmark](eduagarcia/portuguese_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
70
+
71
+ Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:
72
+
73
+ | **Model** | **LeNER** | **UlyNER-PL** | **FGV-STF** | **RRIP** | **Average (%)** |
74
+ |----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
75
+ | | | Coarse/Fine | Coarse | | |
76
+ | [BERTimbau-base](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28) | 88.34 | 86.39/83.83 | 79.34 | 82.34 | 83.78 |
77
+ | [BERTimbau-large](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28) | 88.64 | 87.77/84.74 | 79.71 | **83.79** | 84.60 |
78
+ | [Albertina-PT-BR-base](https://arxiv.org/abs/2305.06721) | 89.26 | 86.35/84.63 | 79.30 | 81.16 | 83.80 |
79
+ | [Albertina-PT-BR-xlarge](https://arxiv.org/abs/2305.06721) | 90.09 | 88.36/**86.62** | 79.94 | 82.79 | 85.08 |
80
+ | [BERTikal-base](https://arxiv.org/abs/2110.15709) | 83.68 | 79.21/75.70 | 77.73 | 81.11 | 79.99 |
81
+ | [JurisBERT-base](https://repositorio.ufms.br/handle/123456789/5119) | 81.74 | 81.67/77.97 | 76.04 | 80.85 | 79.61 |
82
+ | [BERTimbauLAW-base](https://repositorio.ufms.br/handle/123456789/5119) | 84.90 | 87.11/84.42 | 79.78 | 82.35 | 83.20 |
83
+ | [Legal-XLM-R-base](https://arxiv.org/abs/2306.02069) | 87.48 | 83.49/83.16 | 79.79 | 82.35 | 83.24 |
84
+ | [Legal-XLM-R-large](https://arxiv.org/abs/2306.02069) | 88.39 | 84.65/84.55 | 79.36 | 81.66 | 83.50 |
85
+ | [Legal-RoBERTa-PT-large](https://arxiv.org/abs/2306.02069) | 87.96 | 88.32/84.83 | 79.57 | 81.98 | 84.02 |
86
+ | RoBERTaTimbau-base | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 |
87
+ | RoBERTaLegalPT-base | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 |
88
+ | RoBERTaLexPT-base | **90.73** | **88.56**/86.03 | **80.40** | 83.22 | **85.41** |
89
+
90
+ In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size.
91
+ With sufficient pre-training data, it can surpass overparameterized models. The results highlight the importance of domain-diverse training data over sheer model scale.
92
 
93
  ## Training Details
94
 
 
110
 
111
  To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
112
 
 
113
  #### Training Hyperparameters
114
 
115
  The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
116
  We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
117
  The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
118
 
119
+ For other hyperparameters we adopted the standard [RoBERTa hyperparameters](https://arxiv.org/abs/1907.11692):
120
 
121
 
122
  | **Hyperparameter** | **RoBERTa-base** |
 
139
  | AdamW $$\beta_2$$ | 0.98 |
140
  | Gradient clipping | 0.0 |
141
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
  ## Citation
143
 
144
+ ```
145
+ @InProceedings{garcia2024_roberlexpt,
146
+ author="Garcia, Eduardo A. S.
147
+ and Silva, N{\'a}dia F. F.
148
+ and Siqueira, Felipe
149
+ and Gomes, Juliana R. S.
150
+ and Albuqueruqe, Hidelberg O.
151
+ and Souza, Ellen
152
+ and Lima, Eliomar
153
+ and De Carvalho, André",
154
+ title="RoBERTaLexPT: A Legal RoBERTa Model pretrained with deduplication for Portuguese",
155
+ booktitle="Computational Processing of the Portuguese Language",
156
+ year="2024",
157
+ publisher="Association for Computational Linguistics"
158
+ }
159
+ ```
160
 
161
  ## Acknowledgment
162