julien-c HF staff commited on
Commit
67656e1
1 Parent(s): bcac806

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/monilouise/ner_pt_br/README.md

Files changed (1) hide show
  1. README.md +103 -0
README.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - pt
4
+ tags:
5
+ - ner
6
+ metrics:
7
+ - f1
8
+ - accuracy
9
+ - precision
10
+ - recall
11
+ ---
12
+
13
+ # RiskData Brazilian Portuguese NER
14
+
15
+ ## Model description
16
+
17
+ This is a finetunned version from [Neuralmind BERTimbau] (https://github.com/neuralmind-ai/portuguese-bert/blob/master/README.md) for Portuguese language.
18
+
19
+ For more details, please see, (https://github.com/SecexSaudeTCU/noticias_ner).
20
+
21
+ ## Intended uses & limitations
22
+
23
+ #### How to use
24
+
25
+ ```python
26
+ from transformers import BertForTokenClassification, DistilBertTokenizerFast, pipeline
27
+ model = BertForTokenClassification.from_pretrained('monilouise/ner_pt_br')
28
+ tokenizer = DistilBertTokenizerFast.from_pretrained('neuralmind/bert-base-portuguese-cased'
29
+ , model_max_length=512
30
+ , do_lower_case=False
31
+ )
32
+ nlp = pipeline('ner', model=model, tokenizer=tokenizer, grouped_entities=True)
33
+ result = nlp("O Tribunal de Contas da União é localizado em Brasília e foi fundado por Rui Barbosa.")
34
+ ```
35
+
36
+ #### Limitations and bias
37
+
38
+ - The finetunned model was trained on a corpus with around 180 news articles crawled from Google News. The original project's purpose was to recognize named entities in news
39
+ related to fraud and corruption, classifying these entities in four classes: PERSON, ORGANIZATION, PUBLIC INSITUITION and LOCAL (PESSOA, ORGANIZAÇÃO, INSTITUIÇÃO PÚBLICA and LOCAL).
40
+
41
+ ## Training data
42
+
43
+ The training data can be found at (https://github.com/SecexSaudeTCU/noticias_ner/blob/master/dados/labeled_4_labels.jsonl).
44
+
45
+
46
+ ## Training procedure
47
+
48
+
49
+ ## Eval results
50
+
51
+ accuracy: 0.98,
52
+ precision: 0.86
53
+ recall: 0.91
54
+ f1: 0.88
55
+
56
+
57
+ The score was calculated using this code:
58
+
59
+ ```python
60
+ def align_predictions(predictions: np.ndarray, label_ids: np.ndarray) -> Tuple[List[int], List[int]]:
61
+ preds = np.argmax(predictions, axis=2)
62
+ batch_size, seq_len = preds.shape
63
+ out_label_list = [[] for _ in range(batch_size)]
64
+ preds_list = [[] for _ in range(batch_size)]
65
+
66
+ for i in range(batch_size):
67
+ for j in range(seq_len):
68
+ if label_ids[i, j] != nn.CrossEntropyLoss().ignore_index:
69
+ out_label_list[i].append(id2tag[label_ids[i][j]])
70
+ preds_list[i].append(id2tag[preds[i][j]])
71
+
72
+ return preds_list, out_label_list
73
+
74
+ def compute_metrics(p: EvalPrediction) -> Dict:
75
+ preds_list, out_label_list = align_predictions(p.predictions, p.label_ids)
76
+ return {
77
+ "accuracy_score": accuracy_score(out_label_list, preds_list),
78
+ "precision": precision_score(out_label_list, preds_list),
79
+ "recall": recall_score(out_label_list, preds_list),
80
+ "f1": f1_score(out_label_list, preds_list),
81
+ }
82
+ ```
83
+
84
+ ### BibTeX entry and citation info
85
+
86
+ For further information about BERTimbau language model:
87
+
88
+ ```bibtex
89
+ @inproceedings{souza2020bertimbau,
90
+ author = {Souza, F{\'a}bio and Nogueira, Rodrigo and Lotufo, Roberto},
91
+ title = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
92
+ booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
93
+ year = {2020}
94
+ }
95
+
96
+ @article{souza2019portuguese,
97
+ title={Portuguese Named Entity Recognition using BERT-CRF},
98
+ author={Souza, F{\'a}bio and Nogueira, Rodrigo and Lotufo, Roberto},
99
+ journal={arXiv preprint arXiv:1909.10649},
100
+ url={http://arxiv.org/abs/1909.10649},
101
+ year={2019}
102
+ }
103
+ ```