eduagarcia commited on
Commit
d789b2b
1 Parent(s): ecd611d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -16
README.md CHANGED
@@ -108,16 +108,16 @@ Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test spl
108
  | **Model** | **LeNER** | **UlyNER-PL** | **FGV-STF** | **RRIP** | **Average (%)** |
109
  |----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
110
  | | | Coarse/Fine | Coarse | | |
111
- | [BERTimbau-base](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28) | 88.34 | 86.39/83.83 | 79.34 | 82.34 | 83.78 |
112
- | [BERTimbau-large](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28) | 88.64 | 87.77/84.74 | 79.71 | **83.79** | 84.60 |
113
- | [Albertina-PT-BR-base](https://arxiv.org/abs/2305.06721) | 89.26 | 86.35/84.63 | 79.30 | 81.16 | 83.80 |
114
- | [Albertina-PT-BR-xlarge](https://arxiv.org/abs/2305.06721) | 90.09 | 88.36/**86.62** | 79.94 | 82.79 | 85.08 |
115
- | [BERTikal-base](https://arxiv.org/abs/2110.15709) | 83.68 | 79.21/75.70 | 77.73 | 81.11 | 79.99 |
116
- | [JurisBERT-base](https://repositorio.ufms.br/handle/123456789/5119) | 81.74 | 81.67/77.97 | 76.04 | 80.85 | 79.61 |
117
- | [BERTimbauLAW-base](https://repositorio.ufms.br/handle/123456789/5119) | 84.90 | 87.11/84.42 | 79.78 | 82.35 | 83.20 |
118
- | [Legal-XLM-R-base](https://arxiv.org/abs/2306.02069) | 87.48 | 83.49/83.16 | 79.79 | 82.35 | 83.24 |
119
- | [Legal-XLM-R-large](https://arxiv.org/abs/2306.02069) | 88.39 | 84.65/84.55 | 79.36 | 81.66 | 83.50 |
120
- | [Legal-RoBERTa-PT-large](https://arxiv.org/abs/2306.02069) | 87.96 | 88.32/84.83 | 79.57 | 81.98 | 84.02 |
121
  | **Ours** | | | | | |
122
  | RoBERTaTimbau-base (Reproduction of BERTimbau) | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 |
123
  | RoBERTaLegalPT-base (Trained on LegalPT) | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 |
@@ -131,29 +131,30 @@ With sufficient pre-training data, it can surpass larger models. The results hig
131
 
132
  RoBERTaLexPT-base is pretrained from both data:
133
  - [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
134
- - [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) is a composition of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/eduagarcia/brwac_dedup), [CC100-PT](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).
135
 
136
  ### Training Procedure
137
 
138
- Our pretraining process was executed using the [Fairseq library](https://arxiv.org/abs/1904.01038) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs.
139
  The complete training of a single configuration takes approximately three days.
140
 
141
 
142
- This computational setup is similar to the work of [BERTimbau](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28), exposing the model to approximately 65 billion tokens during training.
143
 
144
  #### Preprocessing
145
 
146
- Following the approach of [Lee et al. (2022)](http://arxiv.org/abs/2107.06499), we deduplicated all subsets of the LegalPT Corpus using the [MinHash algorithm](https://dl.acm.org/doi/abs/10.5555/647819.736184) and [Locality Sensitive Hashing](https://dspace.mit.edu/bitstream/handle/1721.1/134231/v008a014.pdf?sequence=2&isAllowed=y) to find clusters of duplicate documents.
147
 
148
  To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
149
 
150
  #### Training Hyperparameters
151
 
152
- The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
 
153
  We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
154
  The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
155
 
156
- For other hyperparameters we adopted the standard [RoBERTa hyperparameters](https://arxiv.org/abs/1907.11692):
157
 
158
 
159
  | **Hyperparameter** | **RoBERTa-base** |
 
108
  | **Model** | **LeNER** | **UlyNER-PL** | **FGV-STF** | **RRIP** | **Average (%)** |
109
  |----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
110
  | | | Coarse/Fine | Coarse | | |
111
+ | [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased) | 88.34 | 86.39/83.83 | 79.34 | 82.34 | 83.78 |
112
+ | [BERTimbau-large](https://huggingface.co/neuralmind/bert-large-portuguese-cased) | 88.64 | 87.77/84.74 | 79.71 | **83.79** | 84.60 |
113
+ | [Albertina-PT-BR-base](https://huggingface.co/PORTULAN/albertina-ptbr-based) | 89.26 | 86.35/84.63 | 79.30 | 81.16 | 83.80 |
114
+ | [Albertina-PT-BR-xlarge](https://huggingface.co/PORTULAN/albertina-ptbr) | 90.09 | 88.36/**86.62** | 79.94 | 82.79 | 85.08 |
115
+ | [BERTikal-base](https://huggingface.co/felipemaiapolo/legalnlp-bert) | 83.68 | 79.21/75.70 | 77.73 | 81.11 | 79.99 |
116
+ | [JurisBERT-base](https://huggingface.co/alfaneo/jurisbert-base-portuguese-uncased) | 81.74 | 81.67/77.97 | 76.04 | 80.85 | 79.61 |
117
+ | [BERTimbauLAW-base](https://huggingface.co/alfaneo/bertimbaulaw-base-portuguese-cased) | 84.90 | 87.11/84.42 | 79.78 | 82.35 | 83.20 |
118
+ | [Legal-XLM-R-base](https://huggingface.co/joelniklaus/legal-xlm-roberta-base) | 87.48 | 83.49/83.16 | 79.79 | 82.35 | 83.24 |
119
+ | [Legal-XLM-R-large](https://huggingface.co/joelniklaus/legal-xlm-roberta-large) | 88.39 | 84.65/84.55 | 79.36 | 81.66 | 83.50 |
120
+ | [Legal-RoBERTa-PT-large](https://huggingface.co/joelniklaus/legal-portuguese-roberta-large) | 87.96 | 88.32/84.83 | 79.57 | 81.98 | 84.02 |
121
  | **Ours** | | | | | |
122
  | RoBERTaTimbau-base (Reproduction of BERTimbau) | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 |
123
  | RoBERTaLegalPT-base (Trained on LegalPT) | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 |
 
131
 
132
  RoBERTaLexPT-base is pretrained from both data:
133
  - [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
134
+ - [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) is a composition of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/brwac), [CC100 PT subset](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301 PT subset](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).
135
 
136
  ### Training Procedure
137
 
138
+ Our pretraining process was executed using the [Fairseq library v0.10.2](https://github.com/facebookresearch/fairseq/tree/v0.10.2) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs.
139
  The complete training of a single configuration takes approximately three days.
140
 
141
 
142
+ This computational cost is similar to the work of [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased), exposing the model to approximately 65 billion tokens during training.
143
 
144
  #### Preprocessing
145
 
146
+ We deduplicated all subsets of the LegalPT and CrawlPT Corpus using the a MinHash algorithm and Locality Sensitive Hashing implementation from the libary [text-dedup](https://github.com/ChenghaoMou/text-dedup) to find clusters of duplicate documents.
147
 
148
  To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
149
 
150
  #### Training Hyperparameters
151
 
152
+ The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 and a learning rate of 4e-4, each sequence containing a maximum of 512 tokens.
153
+ The weight initialization is random.
154
  We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
155
  The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
156
 
157
+ For other parameters we adopted the standard [RoBERTa-base hyperparameters](https://huggingface.co/FacebookAI/roberta-base):
158
 
159
 
160
  | **Hyperparameter** | **RoBERTa-base** |