eduagarcia
/

RoBERTaLexPT-base

@@ -108,16 +108,16 @@ Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test spl
 | **Model**                                                                  | **LeNER** | **UlyNER-PL**   | **FGV-STF** |  **RRIP** | **Average (%)** |
 |----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
 |                                                                            |           | Coarse/Fine     | Coarse      |           |                 |
-| [BERTimbau-base](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28)  | 88.34     | 86.39/83.83     | 79.34       |   82.34   | 83.78           |
-| [BERTimbau-large](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28) | 88.64     | 87.77/84.74     | 79.71       | **83.79** | 84.60           |
-| [Albertina-PT-BR-base](https://arxiv.org/abs/2305.06721)                   | 89.26     | 86.35/84.63     | 79.30       |   81.16   | 83.80           |
-| [Albertina-PT-BR-xlarge](https://arxiv.org/abs/2305.06721)                 | 90.09     | 88.36/**86.62** | 79.94       |   82.79   | 85.08           |
-| [BERTikal-base](https://arxiv.org/abs/2110.15709)                          | 83.68     | 79.21/75.70     | 77.73       |   81.11   | 79.99           |
-| [JurisBERT-base](https://repositorio.ufms.br/handle/123456789/5119)        | 81.74     | 81.67/77.97     | 76.04       |   80.85   | 79.61           |
-| [BERTimbauLAW-base](https://repositorio.ufms.br/handle/123456789/5119)     | 84.90     | 87.11/84.42     | 79.78       |   82.35   | 83.20           |
-| [Legal-XLM-R-base](https://arxiv.org/abs/2306.02069)                       | 87.48     | 83.49/83.16     | 79.79       |   82.35   | 83.24           |
-| [Legal-XLM-R-large](https://arxiv.org/abs/2306.02069)                      | 88.39     | 84.65/84.55     | 79.36       |   81.66   | 83.50           |
-| [Legal-RoBERTa-PT-large](https://arxiv.org/abs/2306.02069)                 | 87.96     | 88.32/84.83     | 79.57       |   81.98   | 84.02           |
 | **Ours**                                                                   |           |                 |             |           |                 |
 | RoBERTaTimbau-base (Reproduction of BERTimbau)                             | 89.68     | 87.53/85.74     | 78.82       |   82.03   | 84.29           |
 | RoBERTaLegalPT-base (Trained on LegalPT)                                   | 90.59     | 85.45/84.40     | 79.92       |   82.84   | 84.57           |
@@ -131,29 +131,30 @@ With sufficient pre-training data, it can surpass larger models. The results hig
 RoBERTaLexPT-base is pretrained from both data:
 - [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
-- [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) is a composition of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/eduagarcia/brwac_dedup), [CC100-PT](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).
 ### Training Procedure
-Our pretraining process was executed using the [Fairseq library](https://arxiv.org/abs/1904.01038) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs.
 The complete training of a single configuration takes approximately three days.
-This computational setup is similar to the work of [BERTimbau](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28), exposing the model to approximately 65 billion tokens during training.
 #### Preprocessing
-Following the approach of [Lee et al. (2022)](http://arxiv.org/abs/2107.06499), we deduplicated all subsets of the LegalPT Corpus using the [MinHash algorithm](https://dl.acm.org/doi/abs/10.5555/647819.736184) and [Locality Sensitive Hashing](https://dspace.mit.edu/bitstream/handle/1721.1/134231/v008a014.pdf?sequence=2&isAllowed=y) to find clusters of duplicate documents.
 To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
 #### Training Hyperparameters
-The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
 We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
 The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
-For other hyperparameters we adopted the standard [RoBERTa hyperparameters](https://arxiv.org/abs/1907.11692):
 | **Hyperparameter**     | **RoBERTa-base** |

 | **Model**                                                                  | **LeNER** | **UlyNER-PL**   | **FGV-STF** |  **RRIP** | **Average (%)** |
 |----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
 |                                                                            |           | Coarse/Fine     | Coarse      |           |                 |
+| [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased)  | 88.34     | 86.39/83.83     | 79.34       |   82.34   | 83.78           |
+| [BERTimbau-large](https://huggingface.co/neuralmind/bert-large-portuguese-cased) | 88.64     | 87.77/84.74     | 79.71       | **83.79** | 84.60           |
+| [Albertina-PT-BR-base](https://huggingface.co/PORTULAN/albertina-ptbr-based)                   | 89.26     | 86.35/84.63     | 79.30       |   81.16   | 83.80           |
+| [Albertina-PT-BR-xlarge](https://huggingface.co/PORTULAN/albertina-ptbr)                 | 90.09     | 88.36/**86.62** | 79.94       |   82.79   | 85.08           |
+| [BERTikal-base](https://huggingface.co/felipemaiapolo/legalnlp-bert)                          | 83.68     | 79.21/75.70     | 77.73       |   81.11   | 79.99           |
+| [JurisBERT-base](https://huggingface.co/alfaneo/jurisbert-base-portuguese-uncased)        | 81.74     | 81.67/77.97     | 76.04       |   80.85   | 79.61           |
+| [BERTimbauLAW-base](https://huggingface.co/alfaneo/bertimbaulaw-base-portuguese-cased)     | 84.90     | 87.11/84.42     | 79.78       |   82.35   | 83.20           |
+| [Legal-XLM-R-base](https://huggingface.co/joelniklaus/legal-xlm-roberta-base)                       | 87.48     | 83.49/83.16     | 79.79       |   82.35   | 83.24           |
+| [Legal-XLM-R-large](https://huggingface.co/joelniklaus/legal-xlm-roberta-large)                      | 88.39     | 84.65/84.55     | 79.36       |   81.66   | 83.50           |
+| [Legal-RoBERTa-PT-large](https://huggingface.co/joelniklaus/legal-portuguese-roberta-large)                 | 87.96     | 88.32/84.83     | 79.57       |   81.98   | 84.02           |
 | **Ours**                                                                   |           |                 |             |           |                 |
 | RoBERTaTimbau-base (Reproduction of BERTimbau)                             | 89.68     | 87.53/85.74     | 78.82       |   82.03   | 84.29           |
 | RoBERTaLegalPT-base (Trained on LegalPT)                                   | 90.59     | 85.45/84.40     | 79.92       |   82.84   | 84.57           |
 RoBERTaLexPT-base is pretrained from both data:
 - [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
+- [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) is a composition of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/brwac), [CC100 PT subset](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301 PT subset](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).
 ### Training Procedure
+Our pretraining process was executed using the [Fairseq library v0.10.2](https://github.com/facebookresearch/fairseq/tree/v0.10.2) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs.
 The complete training of a single configuration takes approximately three days.
+This computational cost is similar to the work of [BERTimbau-base](https://huggingface.co/neuralmind/bert-base-portuguese-cased), exposing the model to approximately 65 billion tokens during training.
 #### Preprocessing
+We deduplicated all subsets of the LegalPT and CrawlPT Corpus using the a MinHash algorithm and Locality Sensitive Hashing implementation from the libary [text-dedup](https://github.com/ChenghaoMou/text-dedup) to find clusters of duplicate documents.
 To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
 #### Training Hyperparameters
+The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 and a learning rate of 4e-4, each sequence containing a maximum of 512 tokens.
+The weight initialization is random.
 We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
 The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
+For other parameters we adopted the standard [RoBERTa-base hyperparameters](https://huggingface.co/FacebookAI/roberta-base):
 | **Hyperparameter**     | **RoBERTa-base** |