pierreguillou commited on
Commit
d31fdbd
1 Parent(s): cb435e6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +154 -0
README.md ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - pt
4
+ tags:
5
+ - generated_from_trainer
6
+ datasets:
7
+ - lener_br
8
+ metrics:
9
+ - precision
10
+ - recall
11
+ - f1
12
+ - accuracy
13
+ model-index:
14
+ - name: checkpoints
15
+ results:
16
+ - task:
17
+ name: Token Classification
18
+ type: token-classification
19
+ dataset:
20
+ name: lener_br
21
+ type: lener_br
22
+ metrics:
23
+ - name: F1
24
+ type: f1
25
+ value: 0.8716487228203504
26
+ - name: Precision
27
+ type: precision
28
+ value: 0.8559286898839138
29
+ - name: Recall
30
+ type: recall
31
+ value: 0.8879569892473118
32
+ - name: Accuracy
33
+ type: accuracy
34
+ value: 0.9755893153732458
35
+ - name: Loss
36
+ type: loss
37
+ value: 0.1133928969502449
38
+ widget:
39
+ - text: "EMENTA: APELAÇÃO CÍVEL - AÇÃO DE INDENIZAÇÃO POR DANOS MORAIS - PRELIMINAR - ARGUIDA PELO MINISTÉRIO PÚBLICO EM GRAU RECURSAL - NULIDADE - AUSÊNCIA DE IN- TERVENÇÃO DO PARQUET NA INSTÂNCIA A QUO - PRESENÇA DE INCAPAZ - PREJUÍZO EXISTENTE - PRELIMINAR ACOLHIDA - NULIDADE RECONHECIDA."
40
+ ---
41
+
42
+ ## (BERT base) NER model in the legal domain in Portuguese (LeNER-Br)
43
+
44
+ **ner-bert-base-portuguese-cased-lenebr** is a NER model (token classification) in the legal domain in Portuguese that was finetuned on 16/12/2021 in Google Colab from the model [BERTimbau base](https://huggingface.co/neuralmind/bert-base-portuguese-cased) on the dataset [LeNER_br](https://huggingface.co/datasets/lener_br) by using a NER objective.
45
+
46
+ Note: due to the small size of BERTimbau base and finetuning dataset, the model overfitted before to reach the end of training. Here are the overall final metrics on the validation dataset (*note: see the paragraph "Validation metrics by Named Entity" to get detailed metrics*):
47
+ - **f1**: 0.8716487228203504
48
+ - **precision**: 0.8559286898839138
49
+ - **recall**: 0.8879569892473118
50
+ - **accuracy**: 0.9755893153732458
51
+ - **loss**: 0.1133928969502449
52
+
53
+ ## Widget & APP
54
+
55
+ You can test this model into the widget of this page.
56
+
57
+ ## Using the model for inference in production
58
+ ````
59
+ # install pytorch: check https://pytorch.org/
60
+ # !pip install transformers
61
+ from transformers import AutoModelForTokenClassification, AutoTokenizer
62
+ import torch
63
+
64
+ # parameters
65
+ model_name = "ner-bert-base-portuguese-cased-lenebr"
66
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
67
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
68
+
69
+ input_text = "EMENTA: APELAÇÃO CÍVEL - AÇÃO DE INDENIZAÇÃO POR DANOS MORAIS - PRELIMINAR - ARGUIDA PELO MINISTÉRIO PÚBLICO EM GRAU RECURSAL - NULIDADE - AUSÊNCIA DE IN- TERVENÇÃO DO PARQUET NA INSTÂNCIA A QUO - PRESENÇA DE INCAPAZ - PREJUÍZO EXISTENTE - PRELIMINAR ACOLHIDA - NULIDADE RECONHECIDA."
70
+
71
+ # tokenization
72
+ inputs = tokenizer(input_text, max_length=512, truncation=True, return_tensors="pt")
73
+ tokens = inputs.tokens()
74
+
75
+ # get predictions
76
+ outputs = model(**inputs).logits
77
+ predictions = torch.argmax(outputs, dim=2)
78
+
79
+ # print predictions
80
+ for token, prediction in zip(tokens, predictions[0].numpy()):
81
+ print((token, model.config.id2label[prediction]))
82
+ ````
83
+ You can use pipeline, too. However, it seems to have an issue regarding to the max_length of the input sequence.
84
+ ````
85
+ !pip install transformers
86
+ import transformers
87
+ from transformers import pipeline
88
+
89
+ model_name = "ner-bert-base-portuguese-cased-lenebr"
90
+
91
+ ner = pipeline(
92
+ "ner",
93
+ model=model_name
94
+ )
95
+
96
+ ner(input_text)
97
+ ````
98
+ ## Training procedure
99
+
100
+ ### Training results
101
+
102
+ ````
103
+ Num examples = 7828
104
+ Num Epochs = 3
105
+ Instantaneous batch size per device = 8
106
+ Total train batch size (w. parallel, distributed & accumulation) = 8
107
+ Gradient Accumulation steps = 1
108
+ Total optimization steps = 2937
109
+
110
+ Step Training Loss Validation Loss Precision Recall F1 Accuracy
111
+ 290 0.315100 0.141881 0.764542 0.709462 0.735973 0.960550
112
+ 580 0.089100 0.137700 0.729155 0.810538 0.767695 0.959940
113
+ 870 0.071700 0.122069 0.780277 0.872903 0.823995 0.967955
114
+ 1160 0.047500 0.125950 0.800312 0.881720 0.839046 0.968367
115
+ 1450 0.034900 0.129228 0.763666 0.910323 0.830570 0.969068
116
+ 1740 0.036100 0.113393 0.855929 0.887957 0.871649 0.975589
117
+ 2030 0.037800 0.121275 0.817230 0.889462 0.851818 0.970393
118
+ 2320 0.018700 0.115745 0.836066 0.877419 0.856243 0.973136
119
+ 2610 0.017100 0.118826 0.822488 0.888817 0.854367 0.973471
120
+ ````
121
+
122
+ ### Validation predictions
123
+ ````
124
+ Num examples = 1177
125
+
126
+ {'JURISPRUDENCIA': {'f1': 0.6641509433962263,
127
+ 'number': 657,
128
+ 'precision': 0.6586826347305389,
129
+ 'recall': 0.669710806697108},
130
+ 'LEGISLACAO': {'f1': 0.8489082969432314,
131
+ 'number': 571,
132
+ 'precision': 0.8466898954703833,
133
+ 'recall': 0.851138353765324},
134
+ 'LOCAL': {'f1': 0.8066037735849058,
135
+ 'number': 194,
136
+ 'precision': 0.7434782608695653,
137
+ 'recall': 0.8814432989690721},
138
+ 'ORGANIZACAO': {'f1': 0.8540462427745664,
139
+ 'number': 1340,
140
+ 'precision': 0.8277310924369747,
141
+ 'recall': 0.8820895522388059},
142
+ 'PESSOA': {'f1': 0.9845722300140253,
143
+ 'number': 1072,
144
+ 'precision': 0.9868791002811621,
145
+ 'recall': 0.9822761194029851},
146
+ 'TEMPO': {'f1': 0.9527794381350867,
147
+ 'number': 816,
148
+ 'precision': 0.9299883313885647,
149
+ 'recall': 0.9767156862745098},
150
+ 'overall_accuracy': 0.9755893153732458,
151
+ 'overall_f1': 0.8716487228203504,
152
+ 'overall_precision': 0.8559286898839138,
153
+ 'overall_recall': 0.8879569892473118}
154
+ ````