dominguesm commited on
Commit
5dedfa1
1 Parent(s): 4a5a028

Update Readme

Browse files
Files changed (1) hide show
  1. README.md +111 -1
README.md CHANGED
@@ -44,11 +44,95 @@ widget:
44
 
45
  **ner-legal-bert-base-cased-ptbr** is a NER model (token classification) in the legal domain in Portuguese that was finetuned from the model [dominguesm/legal-bert-base-cased-ptbr](https://huggingface.co/dominguesm/legal-bert-base-cased-ptbr) by using a NER objective.
46
 
47
- The model is intended to assist NLP research in the legal field, computer law and legal technology applications. Several legal texts in Portuguese were used (more information below).
48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
  ## Training procedure
51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ### Training results
53
 
54
  ```
@@ -58,6 +142,9 @@ Instantaneous batch size per device = 64
58
  Total train batch size (w. parallel, distributed & accumulation) = 128
59
  Gradient Accumulation steps = 2
60
  Total optimization steps = 22779
 
 
 
61
  ```
62
 
63
  | Step | Training Loss | Validation Loss | Precision | Recall | F1 Accuracy |
@@ -84,3 +171,26 @@ Total optimization steps = 22779
84
  |20000 |0.027400 | 0.030577 | 0.942462 | 0.961754 | 0.952010 | |0.989295|
85
  |21000 |0.027000 | 0.030025 | 0.944483 | 0.960497 | 0.952422 | |0.989445|
86
  |22000 |0.026800 | 0.030162 | 0.943868 | 0.961418 | 0.952562 | |0.989425|
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
  **ner-legal-bert-base-cased-ptbr** is a NER model (token classification) in the legal domain in Portuguese that was finetuned from the model [dominguesm/legal-bert-base-cased-ptbr](https://huggingface.co/dominguesm/legal-bert-base-cased-ptbr) by using a NER objective.
46
 
47
+ The model is intended to assist NLP research in the legal field, computer law and legal technology applications. Several legal texts in Portuguese (more information below) were used with the following labels:
48
 
49
+ * `PESSOA`
50
+ * `ORGANIZACAO`
51
+ * `LOCAL`
52
+ * `TEMPO`
53
+ * `LEGISLACAO`
54
+ * `JURISPRUDENCIA`
55
+
56
+ The labels were inspired by the [LeNER_br](https://huggingface.co/datasets/lener_br) dataset.
57
+ ## Training Dataset
58
+
59
+
60
+ The pre-training corpora of **legal-bert-base-cased-ptbr** include:
61
+
62
+ * 971932 examples of miscellaneous legal documents (train split)
63
+ * 53996 examples of miscellaneous legal documents (valid split)
64
+ * 53997 examples of miscellaneous legal documents (test split)
65
+
66
+ The data used was provided by the BRAZILIAN SUPREME FEDERAL TRIBUNAL, through the terms of use: [LREC 2020](https://ailab.unb.br/victor/lrec2020).
67
+
68
+ The results of this project do not imply in any way the position of the BRAZILIAN SUPREME FEDERAL TRIBUNAL, all being the sole and exclusive responsibility of the author of the model.
69
+
70
+ ## Using the model for inference in production
71
+
72
+ ```python
73
+ from transformers import AutoModelForTokenClassification, AutoTokenizer
74
+ import torch
75
+
76
+ # parameters
77
+ model_name = "dominguesm/ner-legal-bert-base-cased-ptbr"
78
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
79
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
80
+
81
+ input_text = "Acrescento que não há de se falar em violação do artigo 114, § 3º, da Constituição Federal, posto que referido dispositivo revela-se impertinente, tratando da possibilidade de ajuizamento de dissídio coletivo pelo Ministério Público do Trabalho nos casos de greve em atividade essencial."
82
+
83
+ # tokenization
84
+ inputs = tokenizer(input_text, max_length=512, truncation=True, return_tensors="pt")
85
+ tokens = inputs.tokens()
86
+
87
+ # get predictions
88
+ outputs = model(**inputs).logits
89
+ predictions = torch.argmax(outputs, dim=2)
90
+
91
+ # print predictions
92
+ for token, prediction in zip(tokens, predictions[0].numpy()):
93
+ print((token, model.config.id2label[prediction]))
94
+ ```
95
+
96
+ You can use pipeline, too. However, it seems to have an issue regarding to the max_length of the input sequence.
97
+
98
+ ```python
99
+ from transformers import pipeline
100
+
101
+ model_name = "dominguesm/ner-legal-bert-base-cased-ptbr"
102
+
103
+ ner = pipeline(
104
+ "ner",
105
+ model=model_name
106
+ )
107
+
108
+ ner(input_text)
109
+ ```
110
 
111
  ## Training procedure
112
 
113
+ ### Hyperparameters
114
+
115
+ #### batch, learning rate...
116
+ - per_device_batch_size = 64
117
+ - gradient_accumulation_steps = 2
118
+ - learning_rate = 2e-5
119
+ - num_train_epochs = 3
120
+ - weight_decay = 0.01
121
+ - optimizer = torch.optim.AdamW
122
+ - epsilon = 1e-08
123
+ - lr_scheduler_type = linear
124
+
125
+ #### save model & load best model
126
+ - save_total_limit = 3
127
+ - logging_steps = 1000
128
+ - eval_steps = logging_steps
129
+ - evaluation_strategy = 'steps'
130
+ - logging_strategy = 'steps'
131
+ - save_strategy = 'steps'
132
+ - save_steps = logging_steps
133
+ - load_best_model_at_end = True
134
+ - fp16 = True
135
+
136
  ### Training results
137
 
138
  ```
 
142
  Total train batch size (w. parallel, distributed & accumulation) = 128
143
  Gradient Accumulation steps = 2
144
  Total optimization steps = 22779
145
+ Evaluation Infos:
146
+ Num examples = 53996
147
+ Batch size = 128
148
  ```
149
 
150
  | Step | Training Loss | Validation Loss | Precision | Recall | F1 Accuracy |
 
171
  |20000 |0.027400 | 0.030577 | 0.942462 | 0.961754 | 0.952010 | |0.989295|
172
  |21000 |0.027000 | 0.030025 | 0.944483 | 0.960497 | 0.952422 | |0.989445|
173
  |22000 |0.026800 | 0.030162 | 0.943868 | 0.961418 | 0.952562 | |0.989425|
174
+
175
+
176
+ ### Validation metrics by Named Entity (Test Dataset)
177
+
178
+
179
+ * **Num examples = 53997**
180
+ * `overall_precision`: 0.9432396865925381
181
+ * `overall_recall`: 0.9614334116769161
182
+ * `overall_f1`: 0.9522496545298874
183
+ * `overall_accuracy`': 0.9894741602608071
184
+
185
+ | Label | Precision | Recall | F1 Accuracy | Entity Examples |
186
+ | ----- | --------- | ------ | ----------- | --------------- |
187
+ | JURISPRUDENCIA| 0.8795197115548148| 0.9037275221501844 | 0.8914593047810311 | 57223 |
188
+ | LEGISLACAO | 0.9405395935529082 | 0.9514071028567378 | 0.9459421362370934 | 84642 |
189
+ | LOCAL | 0.9011495452253004 | 0.9132358124779697 | 0.9071524233856495 | 56740 |
190
+ | ORGANIZACAO | 0.9239028155165304 | 0.954964947845235 | 0.9391771163875446 | 183013 |
191
+ | PESSOA | 0.9651685220572037 | 0.9738545198908279 | 0.9694920661875761 | 193456 |
192
+ | TEMPO | 0.973704616066295 | 0.9918808401799004 | 0.9827086882453152 | 186103 |
193
+
194
+ ## Notes
195
+
196
+ * For the production of this `readme`, i used the `readme` written by Pierre Guillou (available [here](https://huggingface.co/pierreguillou/ner-bert-large-cased-pt-lenerbr)) as a basis, reproducing some parts entirely.