pierreguillou commited on
Commit
ce346a3
1 Parent(s): 96f67e4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +207 -0
README.md ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - pt
4
+ tags:
5
+ - generated_from_trainer
6
+ datasets:
7
+ - lener_br
8
+ metrics:
9
+ - precision
10
+ - recall
11
+ - f1
12
+ - accuracy
13
+ model-index:
14
+ - name: checkpoints
15
+ results:
16
+ - task:
17
+ name: Token Classification
18
+ type: token-classification
19
+ dataset:
20
+ name: lener_br
21
+ type: lener_br
22
+ metrics:
23
+ - name: F1
24
+ type: f1
25
+ value: 0.9082022949426265
26
+ - name: Precision
27
+ type: precision
28
+ value: 0.8975220495590088
29
+ - name: Recall
30
+ type: recall
31
+ value: 0.9191397849462366
32
+ - name: Accuracy
33
+ type: accuracy
34
+ value: 0.9808310603867311
35
+ - name: Loss
36
+ type: loss
37
+ value: 0.1228889599442482
38
+ widget:
39
+ - text: "Ao Instituto Médico Legal da jurisdição do acidente ou da residência cumpre fornecer, no prazo de 90 dias, laudo à vítima (art. 5, § 5, Lei n. 6.194/74 de 19 de dezembro de 1974), função técnica que pode ser suprida por prova pericial realizada por ordem do juízo da causa, ou por prova técnica realizada no âmbito administrativo que se mostre coerente com os demais elementos de prova constante dos autos."
40
+ - text: "Acrescento que não há de se falar em violação do artigo 114, § 3º, da Constituição Federal, posto que referido dispositivo revela-se impertinente, tratando da possibilidade de ajuizamento de dissídio coletivo pelo Ministério Público do Trabalho nos casos de greve em atividade essencial."
41
+ - text: "Dispõe sobre o estágio de estudantes; altera a redação do art. 428 da Consolidação das Leis do Trabalho – CLT, aprovada pelo Decreto-Lei no 5.452, de 1o de maio de 1943, e a Lei no 9.394, de 20 de dezembro de 1996; revoga as Leis nos 6.494, de 7 de dezembro de 1977, e 8.859, de 23 de março de 1994, o parágrafo único do art. 82 da Lei no 9.394, de 20 de dezembro de 1996, e o art. 6o da Medida Provisória no 2.164-41, de 24 de agosto de 2001; e dá outras providências."
42
+ ---
43
+
44
+ ## (BERT large) NER model in the legal domain in Portuguese (LeNER-Br)
45
+
46
+ **ner-bert-large-portuguese-cased-lenerbr** is a NER model (token classification) in the legal domain in Portuguese that was finetuned on 20/12/2021 in Google Colab from the model [pierreguillou/bert-large-cased-pt-lenerbr](https://huggingface.co/pierreguillou/bert-large-cased-pt-lenerbr) on the dataset [LeNER_br](https://huggingface.co/datasets/lener_br) by using a NER objective.
47
+
48
+ Due to the small size of the finetuning dataset, the model overfitted before to reach the end of training. Here are the overall final metrics on the validation dataset (*note: see the paragraph "Validation metrics by Named Entity" to get detailed metrics*):
49
+ - **f1**: 0.9082022949426265
50
+ - **precision**: 0.8975220495590088
51
+ - **recall**: 0.9191397849462366
52
+ - **accuracy**: 0.9808310603867311
53
+ - **loss**: 0.1228889599442482
54
+
55
+ **Note**: the model [pierreguillou/bert-large-cased-pt-lenerbr](https://huggingface.co/pierreguillou/bert-large-cased-pt-lenerbr) is a language model that was created through the finetuning of the model [BERTimbau large](https://huggingface.co/neuralmind/bert-large-portuguese-cased) on the dataset [LeNER-Br language modeling](https://huggingface.co/datasets/pierreguillou/lener_br_finetuning_language_model) by using a MASK objective. This first specialization of the language model before finetuning on the NER task allows to get a better NER model.
56
+
57
+ ## Widget & APP
58
+
59
+ You can test this model into the widget of this page.
60
+
61
+ ## Using the model for inference in production
62
+ ````
63
+ # install pytorch: check https://pytorch.org/
64
+ # !pip install transformers
65
+ from transformers import AutoModelForTokenClassification, AutoTokenizer
66
+ import torch
67
+
68
+ # parameters
69
+ model_name = "ner-bert-large-portuguese-cased-lenebr"
70
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
71
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
72
+
73
+ input_text = "Acrescento que não há de se falar em violação do artigo 114, § 3º, da Constituição Federal, posto que referido dispositivo revela-se impertinente, tratando da possibilidade de ajuizamento de dissídio coletivo pelo Ministério Público do Trabalho nos casos de greve em atividade essencial."
74
+
75
+ # tokenization
76
+ inputs = tokenizer(input_text, max_length=512, truncation=True, return_tensors="pt")
77
+ tokens = inputs.tokens()
78
+
79
+ # get predictions
80
+ outputs = model(**inputs).logits
81
+ predictions = torch.argmax(outputs, dim=2)
82
+
83
+ # print predictions
84
+ for token, prediction in zip(tokens, predictions[0].numpy()):
85
+ print((token, model.config.id2label[prediction]))
86
+ ````
87
+ You can use pipeline, too. However, it seems to have an issue regarding to the max_length of the input sequence.
88
+ ````
89
+ !pip install transformers
90
+ import transformers
91
+ from transformers import pipeline
92
+
93
+ model_name = "ner-bert-large-portuguese-cased-lenebr"
94
+
95
+ ner = pipeline(
96
+ "ner",
97
+ model=model_name
98
+ )
99
+
100
+ ner(input_text)
101
+ ````
102
+ ## Training procedure
103
+
104
+ ### Notebook
105
+
106
+ The notebook of finetuning ([HuggingFace_Notebook_token_classification_NER_LeNER_Br.ipynb](https://github.com/piegu/language-models/blob/master/HuggingFace_Notebook_token_classification_NER_LeNER_Br.ipynb)) is in github.
107
+
108
+ ### Hyperparameters
109
+
110
+ # batch, learning rate...
111
+ - per_device_batch_size = 2
112
+ - gradient_accumulation_steps = 2
113
+ - learning_rate = 2e-5
114
+ - num_train_epochs = 10
115
+ - weight_decay = 0.01
116
+ - optimizer = AdamW
117
+ - betas = (0.9,0.999)
118
+ - epsilon = 1e-08
119
+ - lr_scheduler_type = linear
120
+ - seed = 42
121
+
122
+ # save model & load best model
123
+ - save_total_limit = 7
124
+ - logging_steps = 500
125
+ - eval_steps = logging_steps
126
+ - evaluation_strategy = 'steps'
127
+ - logging_strategy = 'steps'
128
+ - save_strategy = 'steps'
129
+ - save_steps = logging_steps
130
+ - load_best_model_at_end = True
131
+ - fp16 = True
132
+
133
+ # get best model through a metric
134
+ - metric_for_best_model = 'eval_f1'
135
+ - greater_is_better = True
136
+
137
+ ### Training results
138
+
139
+ ````
140
+ Num examples = 7828
141
+ Num Epochs = 20
142
+ Instantaneous batch size per device = 2
143
+ Total train batch size (w. parallel, distributed & accumulation) = 4
144
+ Gradient Accumulation steps = 2
145
+ Total optimization steps = 39140
146
+
147
+ Step Training Loss Validation Loss Precision Recall F1 Accuracy
148
+ 500 0.250000 0.140582 0.760833 0.770323 0.765548 0.963125
149
+ 1000 0.076200 0.117882 0.829082 0.817849 0.823428 0.966569
150
+ 1500 0.082400 0.150047 0.679610 0.914624 0.779795 0.957213
151
+ 2000 0.047500 0.133443 0.817678 0.857419 0.837077 0.969190
152
+ 2500 0.034200 0.230139 0.895672 0.845591 0.869912 0.964070
153
+ 3000 0.033800 0.108022 0.859225 0.887312 0.873043 0.973700
154
+ 3500 0.030100 0.113467 0.855747 0.885376 0.870310 0.975879
155
+ 4000 0.029900 0.118619 0.850207 0.884946 0.867229 0.974477
156
+ 4500 0.022500 0.124327 0.841048 0.890968 0.865288 0.975041
157
+ 5000 0.020200 0.129294 0.801538 0.918925 0.856227 0.968077
158
+ 5500 0.019700 0.128344 0.814222 0.908602 0.858827 0.969250
159
+ 6000 0.024600 0.182563 0.908087 0.866882 0.887006 0.968565
160
+ 6500 0.012600 0.159217 0.829883 0.913763 0.869806 0.969357
161
+ 7000 0.020600 0.183726 0.854557 0.893333 0.873515 0.966447
162
+ 7500 0.014400 0.141395 0.777716 0.905161 0.836613 0.966828
163
+ 8000 0.013400 0.139378 0.873042 0.899140 0.885899 0.975772
164
+ 8500 0.014700 0.142521 0.864152 0.901505 0.882433 0.976366
165
+
166
+ 9000 0.010900 0.122889 0.897522 0.919140 0.908202 0.980831
167
+
168
+ 9500 0.013500 0.143407 0.816580 0.906667 0.859268 0.973395
169
+ 10000 0.010400 0.144946 0.835608 0.908387 0.870479 0.974629
170
+ 10500 0.007800 0.143086 0.847587 0.910108 0.877735 0.975985
171
+ 11000 0.008200 0.156379 0.873778 0.884301 0.879008 0.976321
172
+ 11500 0.008200 0.133356 0.901193 0.910108 0.905628 0.980328
173
+ 12000 0.006900 0.133476 0.892202 0.920215 0.905992 0.980572
174
+ 12500 0.006900 0.129991 0.890159 0.904516 0.897280 0.978683
175
+ ````
176
+
177
+ ### Validation metrics by Named Entity
178
+ ````
179
+ {'JURISPRUDENCIA': {'f1': 0.8135593220338984,
180
+ 'number': 657,
181
+ 'precision': 0.865979381443299,
182
+ 'recall': 0.7671232876712328},
183
+ 'LEGISLACAO': {'f1': 0.8888888888888888,
184
+ 'number': 571,
185
+ 'precision': 0.8952042628774423,
186
+ 'recall': 0.882661996497373},
187
+ 'LOCAL': {'f1': 0.850467289719626,
188
+ 'number': 194,
189
+ 'precision': 0.7777777777777778,
190
+ 'recall': 0.9381443298969072},
191
+ 'ORGANIZACAO': {'f1': 0.8740635033892258,
192
+ 'number': 1340,
193
+ 'precision': 0.8373205741626795,
194
+ 'recall': 0.914179104477612},
195
+ 'PESSOA': {'f1': 0.9836677554829678,
196
+ 'number': 1072,
197
+ 'precision': 0.9841269841269841,
198
+ 'recall': 0.9832089552238806},
199
+ 'TEMPO': {'f1': 0.9669669669669669,
200
+ 'number': 816,
201
+ 'precision': 0.9481743227326266,
202
+ 'recall': 0.9865196078431373},
203
+ 'overall_accuracy': 0.9808310603867311,
204
+ 'overall_f1': 0.9082022949426265,
205
+ 'overall_precision': 0.8975220495590088,
206
+ 'overall_recall': 0.9191397849462366}
207
+ ````