Upload 7 files
Browse files- README.md +161 -0
- config.json +69 -0
- flax_model.msgpack +3 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +1 -0
- tokenizer_config.json +1 -0
- vocab.txt +0 -0
README.md
ADDED
@@ -0,0 +1,161 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: "en"
|
3 |
+
---
|
4 |
+
|
5 |
+
# SciBERT finetuned on JNLPA for NER downstream task
|
6 |
+
## Language Model
|
7 |
+
[SciBERT](https://arxiv.org/pdf/1903.10676.pdf) is a pretrained language model based on BERT and trained by the
|
8 |
+
[Allen Institute for AI](https://allenai.org/) on papers from the corpus of
|
9 |
+
[Semantic Scholar](https://www.semanticscholar.org/).
|
10 |
+
Corpus size is 1.14M papers, 3.1B tokens. SciBERT has its own vocabulary (scivocab) that's built to best match
|
11 |
+
the training corpus.
|
12 |
+
|
13 |
+
## Downstream task
|
14 |
+
[`allenai/scibert_scivocab_cased`](https://huggingface.co/allenai/scibert_scivocab_cased#) has been finetuned for Named Entity
|
15 |
+
Recognition (NER) dowstream task. The code to train the NER can be found [here](https://github.com/fran-martinez/bio_ner_bert).
|
16 |
+
|
17 |
+
### Data
|
18 |
+
The corpus used to fine-tune the NER is [BioNLP / JNLPBA shared task](http://www.geniaproject.org/shared-tasks/bionlp-jnlpba-shared-task-2004).
|
19 |
+
|
20 |
+
- Training data consist of 2,000 PubMed abstracts with term/word annotation. This corresponds to 18,546 samples (senteces).
|
21 |
+
- Evaluation data consist of 404 PubMed abstracts with term/word annotation. This corresponds to 3,856 samples (sentences).
|
22 |
+
|
23 |
+
The classes (at word level) and its distribution (number of examples for each class) for training and evaluation datasets are shown below:
|
24 |
+
|
25 |
+
| Class Label | # training examples| # evaluation examples|
|
26 |
+
|:--------------|--------------:|----------------:|
|
27 |
+
|O | 382,963 | 81,647 |
|
28 |
+
|B-protein | 30,269 | 5,067 |
|
29 |
+
|I-protein | 24,848 | 4,774 |
|
30 |
+
|B-cell_type | 6,718 | 1,921 |
|
31 |
+
|I-cell_type | 8,748 | 2,991 |
|
32 |
+
|B-DNA | 9,533 | 1,056 |
|
33 |
+
|I-DNA | 15,774 | 1,789 |
|
34 |
+
|B-cell_line | 3,830 | 500 |
|
35 |
+
|I-cell_line | 7,387 | 9,89 |
|
36 |
+
|B-RNA | 951 | 118 |
|
37 |
+
|I-RNA | 1,530 | 187 |
|
38 |
+
|
39 |
+
### Model
|
40 |
+
An exhaustive hyperparameter search was done.
|
41 |
+
The hyperparameters that provided the best results are:
|
42 |
+
|
43 |
+
- Max length sequence: 128
|
44 |
+
- Number of epochs: 6
|
45 |
+
- Batch size: 32
|
46 |
+
- Dropout: 0.3
|
47 |
+
- Optimizer: Adam
|
48 |
+
|
49 |
+
The used learning rate was 5e-5 with a decreasing linear schedule. A warmup was used at the beggining of the training
|
50 |
+
with a ratio of steps equal to 0.1 from the total training steps.
|
51 |
+
|
52 |
+
The model from the epoch with the best F1-score was selected, in this case, the model from epoch 5.
|
53 |
+
|
54 |
+
|
55 |
+
### Evaluation
|
56 |
+
The following table shows the evaluation metrics calculated at span/entity level:
|
57 |
+
|
58 |
+
| | precision| recall| f1-score|
|
59 |
+
|:---------|-----------:|---------:|---------:|
|
60 |
+
cell_line | 0.5205 | 0.7100 | 0.6007 |
|
61 |
+
cell_type | 0.7736 | 0.7422 | 0.7576 |
|
62 |
+
protein | 0.6953 | 0.8459 | 0.7633 |
|
63 |
+
DNA | 0.6997 | 0.7894 | 0.7419 |
|
64 |
+
RNA | 0.6985 | 0.8051 | 0.7480 |
|
65 |
+
| | | |
|
66 |
+
**micro avg** | 0.6984 | 0.8076 | 0.7490|
|
67 |
+
**macro avg** | 0.7032 | 0.8076 | 0.7498 |
|
68 |
+
|
69 |
+
The macro F1-score is equal to 0.7498, compared to the value provided by the Allen Institute for AI in their
|
70 |
+
[paper](https://arxiv.org/pdf/1903.10676.pdf), which is equal to 0.7728. This drop in performance could be due to
|
71 |
+
several reasons, but one hypothesis could be the fact that the authors used an additional conditional random field,
|
72 |
+
while this model uses a regular classification layer with softmax activation on top of SciBERT model.
|
73 |
+
|
74 |
+
At word level, this model achieves a precision of 0.7742, a recall of 0.8536 and a F1-score of 0.8093.
|
75 |
+
|
76 |
+
### Model usage in inference
|
77 |
+
Use the pipeline:
|
78 |
+
````python
|
79 |
+
from transformers import pipeline
|
80 |
+
|
81 |
+
text = "Mouse thymus was used as a source of glucocorticoid receptor from normal CS lymphocytes."
|
82 |
+
|
83 |
+
nlp_ner = pipeline("ner",
|
84 |
+
model='fran-martinez/scibert_scivocab_cased_ner_jnlpba',
|
85 |
+
tokenizer='fran-martinez/scibert_scivocab_cased_ner_jnlpba')
|
86 |
+
|
87 |
+
nlp_ner(text)
|
88 |
+
|
89 |
+
"""
|
90 |
+
Output:
|
91 |
+
---------------------------
|
92 |
+
[
|
93 |
+
{'word': 'glucocorticoid',
|
94 |
+
'score': 0.9894881248474121,
|
95 |
+
'entity': 'B-protein'},
|
96 |
+
|
97 |
+
{'word': 'receptor',
|
98 |
+
'score': 0.989505410194397,
|
99 |
+
'entity': 'I-protein'},
|
100 |
+
|
101 |
+
{'word': 'normal',
|
102 |
+
'score': 0.7680378556251526,
|
103 |
+
'entity': 'B-cell_type'},
|
104 |
+
|
105 |
+
{'word': 'cs',
|
106 |
+
'score': 0.5176806449890137,
|
107 |
+
'entity': 'I-cell_type'},
|
108 |
+
|
109 |
+
{'word': 'lymphocytes',
|
110 |
+
'score': 0.9898491501808167,
|
111 |
+
'entity': 'I-cell_type'}
|
112 |
+
]
|
113 |
+
"""
|
114 |
+
````
|
115 |
+
Or load model and tokenizer as follows:
|
116 |
+
````python
|
117 |
+
import torch
|
118 |
+
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
119 |
+
|
120 |
+
# Example
|
121 |
+
text = "Mouse thymus was used as a source of glucocorticoid receptor from normal CS lymphocytes."
|
122 |
+
|
123 |
+
# Load model
|
124 |
+
tokenizer = AutoTokenizer.from_pretrained("fran-martinez/scibert_scivocab_cased_ner_jnlpba")
|
125 |
+
model = AutoModelForTokenClassification.from_pretrained("fran-martinez/scibert_scivocab_cased_ner_jnlpba")
|
126 |
+
|
127 |
+
# Get input for BERT
|
128 |
+
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
|
129 |
+
|
130 |
+
# Predict
|
131 |
+
with torch.no_grad():
|
132 |
+
outputs = model(input_ids)
|
133 |
+
|
134 |
+
# From the output let's take the first element of the tuple.
|
135 |
+
# Then, let's get rid of [CLS] and [SEP] tokens (first and last)
|
136 |
+
predictions = outputs[0].argmax(axis=-1)[0][1:-1]
|
137 |
+
|
138 |
+
# Map label class indexes to string labels.
|
139 |
+
for token, pred in zip(tokenizer.tokenize(text), predictions):
|
140 |
+
print(token, '->', model.config.id2label[pred.numpy().item()])
|
141 |
+
|
142 |
+
"""
|
143 |
+
Output:
|
144 |
+
---------------------------
|
145 |
+
mouse -> O
|
146 |
+
thymus -> O
|
147 |
+
was -> O
|
148 |
+
used -> O
|
149 |
+
as -> O
|
150 |
+
a -> O
|
151 |
+
source -> O
|
152 |
+
of -> O
|
153 |
+
glucocorticoid -> B-protein
|
154 |
+
receptor -> I-protein
|
155 |
+
from -> O
|
156 |
+
normal -> B-cell_type
|
157 |
+
cs -> I-cell_type
|
158 |
+
lymphocytes -> I-cell_type
|
159 |
+
. -> O
|
160 |
+
"""
|
161 |
+
````
|
config.json
ADDED
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_num_labels": 11,
|
3 |
+
"architectures": [
|
4 |
+
"BertForTokenClassification"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.3,
|
7 |
+
"bos_token_id": null,
|
8 |
+
"do_sample": false,
|
9 |
+
"early_stopping": false,
|
10 |
+
"eos_token_id": null,
|
11 |
+
"finetuning_task": null,
|
12 |
+
"hidden_act": "gelu",
|
13 |
+
"hidden_dropout_prob": 0.3,
|
14 |
+
"hidden_size": 768,
|
15 |
+
"id2label": {
|
16 |
+
"0": "I-cell_type",
|
17 |
+
"1": "B-DNA",
|
18 |
+
"10": "B-cell_type",
|
19 |
+
"2": "O",
|
20 |
+
"3": "I-cell_line",
|
21 |
+
"4": "I-protein",
|
22 |
+
"5": "I-RNA",
|
23 |
+
"6": "B-cell_line",
|
24 |
+
"7": "B-RNA",
|
25 |
+
"8": "I-DNA",
|
26 |
+
"9": "B-protein"
|
27 |
+
},
|
28 |
+
"initializer_range": 0.02,
|
29 |
+
"intermediate_size": 3072,
|
30 |
+
"is_decoder": false,
|
31 |
+
"is_encoder_decoder": false,
|
32 |
+
"label2id": {
|
33 |
+
"LABEL_0": 0,
|
34 |
+
"LABEL_1": 1,
|
35 |
+
"LABEL_10": 10,
|
36 |
+
"LABEL_2": 2,
|
37 |
+
"LABEL_3": 3,
|
38 |
+
"LABEL_4": 4,
|
39 |
+
"LABEL_5": 5,
|
40 |
+
"LABEL_6": 6,
|
41 |
+
"LABEL_7": 7,
|
42 |
+
"LABEL_8": 8,
|
43 |
+
"LABEL_9": 9
|
44 |
+
},
|
45 |
+
"layer_norm_eps": 1e-12,
|
46 |
+
"length_penalty": 1.0,
|
47 |
+
"max_length": 20,
|
48 |
+
"max_position_embeddings": 512,
|
49 |
+
"min_length": 0,
|
50 |
+
"model_type": "bert",
|
51 |
+
"no_repeat_ngram_size": 0,
|
52 |
+
"num_attention_heads": 12,
|
53 |
+
"num_beams": 1,
|
54 |
+
"num_hidden_layers": 12,
|
55 |
+
"num_return_sequences": 1,
|
56 |
+
"output_attentions": false,
|
57 |
+
"output_hidden_states": false,
|
58 |
+
"output_past": true,
|
59 |
+
"pad_token_id": 0,
|
60 |
+
"pruned_heads": {},
|
61 |
+
"repetition_penalty": 1.0,
|
62 |
+
"temperature": 1.0,
|
63 |
+
"top_k": 50,
|
64 |
+
"top_p": 1.0,
|
65 |
+
"torchscript": false,
|
66 |
+
"type_vocab_size": 2,
|
67 |
+
"use_bfloat16": false,
|
68 |
+
"vocab_size": 31090
|
69 |
+
}
|
flax_model.msgpack
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:69666bb5a436690197ee7e3ff010891140b85cc3dab7013a205df9555cce00ea
|
3 |
+
size 437352466
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f318c0c9452000f211edc4bc5b7eb0fea906e55544af8004d3ab09cea02924eb
|
3 |
+
size 439757565
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{}
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|