eduagarcia commited on
Commit
a9838f1
1 Parent(s): f7d589f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -17
README.md CHANGED
@@ -15,40 +15,77 @@ model-index:
15
  - task:
16
  type: token-classification
17
  dataset:
18
- type: eduagarcia/portuguese_benchmark
19
- name: LeNER
20
- config: LeNER-Br
21
  split: test
22
  metrics:
23
  - type: seqeval
24
- value: 90.73
25
- name: Mean F1
26
  args:
27
  scheme: IOB2
28
  - task:
29
  type: token-classification
30
  dataset:
31
- type: eduagarcia/portuguese_benchmark
32
  name: UlyNER-PL Coarse
33
  config: UlyssesNER-Br-PL-coarse
34
  split: test
35
  metrics:
36
  - type: seqeval
37
- value: 88.56
38
- name: Mean F1
39
  args:
40
  scheme: IOB2
41
  - task:
42
  type: token-classification
43
  dataset:
44
- type: eduagarcia/portuguese_benchmark
45
  name: UlyNER-PL Fine
46
  config: UlyssesNER-Br-PL-fine
47
  split: test
48
  metrics:
49
  - type: seqeval
50
- value: 86.03
51
- name: Mean F1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  args:
53
  scheme: IOB2
54
  license: cc-by-4.0
@@ -57,7 +94,7 @@ metrics:
57
  ---
58
  # RoBERTaLexPT-base
59
 
60
- RoBERTaLexPT-base is a Portuguese Masked Language Model pretrained from scratch from the [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT) and CrawlPT corpora, using the same architecture as [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
61
 
62
  - **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
63
  - **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
@@ -66,7 +103,7 @@ RoBERTaLexPT-base is a Portuguese Masked Language Model pretrained from scratch
66
 
67
  ## Evaluation
68
 
69
- The model was evaluated on ["PortuLex" benchmark](eduagarcia/portuguese_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
70
 
71
  Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:
72
 
@@ -87,16 +124,16 @@ Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test spl
87
  | RoBERTaTimbau-base (Reproduction of BERTimbau) | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 |
88
  | RoBERTaLegalPT-base (Trained on LegalPT) | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 |
89
  | RoBERTaCrawlPT-base (Trained on CrawlPT) | 89.24 | 88.22/86.58 | 79.88 | 82.80 | 84.83 |
90
- | RoBERTaLexPT-base (this) (Trained on CrawlPT + LegalPT) | **90.73** | **88.56**/86.03 | **80.40** | 83.22 | **85.41** |
91
 
92
  In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size.
93
- With sufficient pre-training data, it can surpass overparameterized models. The results highlight the importance of domain-diverse training data over sheer model scale.
94
 
95
  ## Training Details
96
 
97
  RoBERTaLexPT-base is pretrained from both data:
98
- - [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
99
- - CrawlPT is a duplication of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/eduagarcia/brwac_dedup), [CC100-PT](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).
100
 
101
  ### Training Procedure
102
 
 
15
  - task:
16
  type: token-classification
17
  dataset:
18
+ type: lener_br
19
+ name: LeNER-Br
 
20
  split: test
21
  metrics:
22
  - type: seqeval
23
+ value: 0.9073
24
+ name: F1
25
  args:
26
  scheme: IOB2
27
  - task:
28
  type: token-classification
29
  dataset:
30
+ type: eduagarcia/PortuLex_benchmark
31
  name: UlyNER-PL Coarse
32
  config: UlyssesNER-Br-PL-coarse
33
  split: test
34
  metrics:
35
  - type: seqeval
36
+ value: 0.8856
37
+ name: F1
38
  args:
39
  scheme: IOB2
40
  - task:
41
  type: token-classification
42
  dataset:
43
+ type: eduagarcia/PortuLex_benchmark
44
  name: UlyNER-PL Fine
45
  config: UlyssesNER-Br-PL-fine
46
  split: test
47
  metrics:
48
  - type: seqeval
49
+ value: 0.8603
50
+ name: F1
51
+ args:
52
+ scheme: IOB2
53
+ - task:
54
+ type: token-classification
55
+ dataset:
56
+ type: eduagarcia/PortuLex_benchmark
57
+ name: FGV-STF
58
+ config: fgv-coarse
59
+ split: test
60
+ metrics:
61
+ - type: seqeval
62
+ value: 0.8040
63
+ name: F1
64
+ args:
65
+ scheme: IOB2
66
+ - task:
67
+ type: token-classification
68
+ dataset:
69
+ type: eduagarcia/PortuLex_benchmark
70
+ name: RRIP
71
+ config: rrip
72
+ split: test
73
+ metrics:
74
+ - type: seqeval
75
+ value: 0.8322
76
+ name: F1
77
+ args:
78
+ scheme: IOB2
79
+ - task:
80
+ type: token-classification
81
+ dataset:
82
+ type: eduagarcia/PortuLex_benchmark
83
+ name: PortuLex
84
+ split: test
85
+ metrics:
86
+ - type: seqeval
87
+ value: 0.8541
88
+ name: Average F1
89
  args:
90
  scheme: IOB2
91
  license: cc-by-4.0
 
94
  ---
95
  # RoBERTaLexPT-base
96
 
97
+ RoBERTaLexPT-base is a Portuguese Masked Language Model pretrained from scratch from the [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) and [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) corpora, using the same architecture as [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
98
 
99
  - **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
100
  - **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
 
103
 
104
  ## Evaluation
105
 
106
+ The model was evaluated on ["PortuLex" benchmark](eduagarcia/PortuLex_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
107
 
108
  Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:
109
 
 
124
  | RoBERTaTimbau-base (Reproduction of BERTimbau) | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 |
125
  | RoBERTaLegalPT-base (Trained on LegalPT) | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 |
126
  | RoBERTaCrawlPT-base (Trained on CrawlPT) | 89.24 | 88.22/86.58 | 79.88 | 82.80 | 84.83 |
127
+ | **RoBERTaLexPT-base** (Trained on CrawlPT + LegalPT) | **90.73** | **88.56**/86.03 | **80.40** | 83.22 | **85.41** |
128
 
129
  In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size.
130
+ With sufficient pre-training data, it can surpass larger models. The results highlight the importance of domain-diverse training data over sheer model scale.
131
 
132
  ## Training Details
133
 
134
  RoBERTaLexPT-base is pretrained from both data:
135
+ - [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
136
+ - [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) is a composition of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/eduagarcia/brwac_dedup), [CC100-PT](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).
137
 
138
  ### Training Procedure
139