Upload 7 files

Browse files

Files changed (7) hide show

README.md +161 -0
config.json +69 -0
flax_model.msgpack +3 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tokenizer_config.json +1 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,161 @@

+---
+language: "en"
+---
+# SciBERT finetuned on JNLPA for NER downstream task
+## Language Model
+ [SciBERT](https://arxiv.org/pdf/1903.10676.pdf) is a pretrained language model based on BERT and trained by the
+ [Allen Institute for AI](https://allenai.org/) on papers from the corpus of
+ [Semantic Scholar](https://www.semanticscholar.org/).
+ Corpus size is 1.14M papers, 3.1B tokens. SciBERT has its own vocabulary (scivocab) that's built to best match
+ the training corpus.
+## Downstream task
+[`allenai/scibert_scivocab_cased`](https://huggingface.co/allenai/scibert_scivocab_cased#) has been finetuned for Named Entity
+Recognition (NER) dowstream task. The code to train the NER can be found [here](https://github.com/fran-martinez/bio_ner_bert).
+### Data
+The corpus used to fine-tune the NER is [BioNLP / JNLPBA shared task](http://www.geniaproject.org/shared-tasks/bionlp-jnlpba-shared-task-2004).
+- Training data consist of 2,000 PubMed abstracts with term/word annotation. This corresponds to 18,546 samples (senteces).
+- Evaluation data consist of 404 PubMed abstracts with term/word annotation. This corresponds to 3,856 samples (sentences).
+The classes (at word level) and its distribution (number of examples for each class) for training and evaluation datasets are shown below:
+| Class Label         | # training examples| # evaluation examples|
+|:--------------|--------------:|----------------:|
+|O              |   382,963     |     81,647      |
+|B-protein      |    30,269     |      5,067      |
+|I-protein      |    24,848     |      4,774      |
+|B-cell_type    |     6,718     |      1,921      |
+|I-cell_type    |     8,748     |      2,991      |
+|B-DNA          |     9,533     |      1,056      |
+|I-DNA          |    15,774     |      1,789      |
+|B-cell_line    |     3,830     |        500      |
+|I-cell_line    |     7,387     |       9,89      |
+|B-RNA          |       951     |        118      |
+|I-RNA          |     1,530     |        187      |
+### Model
+An exhaustive hyperparameter search was done.
+The hyperparameters that provided the best results are:
+- Max length sequence: 128
+- Number of epochs: 6
+- Batch size: 32
+- Dropout: 0.3
+- Optimizer: Adam
+The used learning rate was 5e-5 with a decreasing linear schedule. A warmup was used at the beggining of the training
+with a ratio of steps equal to 0.1 from the total training steps.
+The model from the epoch with the best F1-score was selected, in this case, the model from epoch 5.
+### Evaluation
+The following table shows the evaluation metrics calculated at span/entity level:
+|          |   precision|    recall|  f1-score|
+|:---------|-----------:|---------:|---------:|
+cell_line   |  0.5205   | 0.7100   | 0.6007   |
+cell_type   |  0.7736   | 0.7422   | 0.7576   |
+protein     |  0.6953   | 0.8459   | 0.7633   |
+DNA         |  0.6997   | 0.7894   | 0.7419   |
+RNA         |  0.6985   | 0.8051   | 0.7480   |
+|           |          |          |
+**micro avg**   |  0.6984   | 0.8076  |  0.7490|
+**macro avg**   | 0.7032   | 0.8076   | 0.7498 |
+The macro F1-score is equal to 0.7498, compared to the value provided by the Allen Institute for AI in their
+[paper](https://arxiv.org/pdf/1903.10676.pdf), which is equal to 0.7728. This drop in performance could be due to
+several reasons, but one hypothesis could be the fact that the authors used an additional conditional random field,
+while this model uses a regular classification layer with softmax activation on top of SciBERT model.
+At word level, this model achieves a precision of 0.7742, a recall of 0.8536 and a F1-score of 0.8093.
+### Model usage in inference
+Use the pipeline:
+````python
+from transformers import pipeline
+text = "Mouse thymus was used as a source of glucocorticoid receptor from normal CS lymphocytes."
+nlp_ner = pipeline("ner",
+                   model='fran-martinez/scibert_scivocab_cased_ner_jnlpba',
+                   tokenizer='fran-martinez/scibert_scivocab_cased_ner_jnlpba')
+nlp_ner(text)
+"""
+Output:
+---------------------------
+[
+{'word': 'glucocorticoid',
+'score': 0.9894881248474121,
+'entity': 'B-protein'},
+{'word': 'receptor',
+'score': 0.989505410194397,
+'entity': 'I-protein'},
+{'word': 'normal',
+'score': 0.7680378556251526,
+'entity': 'B-cell_type'},
+{'word': 'cs',
+'score': 0.5176806449890137,
+'entity': 'I-cell_type'},
+{'word': 'lymphocytes',
+'score': 0.9898491501808167,
+'entity': 'I-cell_type'}
+]
+"""
+````
+Or load model and tokenizer as follows:
+````python
+import torch
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+# Example
+text = "Mouse thymus was used as a source of glucocorticoid receptor from normal CS lymphocytes."
+# Load model
+tokenizer = AutoTokenizer.from_pretrained("fran-martinez/scibert_scivocab_cased_ner_jnlpba")
+model = AutoModelForTokenClassification.from_pretrained("fran-martinez/scibert_scivocab_cased_ner_jnlpba")
+# Get input for BERT
+input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
+# Predict
+with torch.no_grad():
+  outputs = model(input_ids)
+# From the output let's take the first element of the tuple.
+# Then, let's get rid of [CLS] and [SEP] tokens (first and last)
+predictions = outputs[0].argmax(axis=-1)[0][1:-1]
+# Map label class indexes to string labels.
+for token, pred in zip(tokenizer.tokenize(text), predictions):
+  print(token, '->', model.config.id2label[pred.numpy().item()])
+"""
+Output:
+---------------------------
+mouse -> O
+thymus -> O
+was -> O
+used -> O
+as -> O
+a -> O
+source -> O
+of -> O
+glucocorticoid -> B-protein
+receptor -> I-protein
+from -> O
+normal -> B-cell_type
+cs -> I-cell_type
+lymphocytes -> I-cell_type
+. -> O
+"""
+````

config.json ADDED Viewed

	@@ -0,0 +1,69 @@

+{
+  "_num_labels": 11,
+  "architectures": [
+    "BertForTokenClassification"
+  ],
+  "attention_probs_dropout_prob": 0.3,
+  "bos_token_id": null,
+  "do_sample": false,
+  "early_stopping": false,
+  "eos_token_id": null,
+  "finetuning_task": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.3,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "I-cell_type",
+    "1": "B-DNA",
+    "10": "B-cell_type",
+    "2": "O",
+    "3": "I-cell_line",
+    "4": "I-protein",
+    "5": "I-RNA",
+    "6": "B-cell_line",
+    "7": "B-RNA",
+    "8": "I-DNA",
+    "9": "B-protein"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "is_decoder": false,
+  "is_encoder_decoder": false,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1,
+    "LABEL_10": 10,
+    "LABEL_2": 2,
+    "LABEL_3": 3,
+    "LABEL_4": 4,
+    "LABEL_5": 5,
+    "LABEL_6": 6,
+    "LABEL_7": 7,
+    "LABEL_8": 8,
+    "LABEL_9": 9
+  },
+  "layer_norm_eps": 1e-12,
+  "length_penalty": 1.0,
+  "max_length": 20,
+  "max_position_embeddings": 512,
+  "min_length": 0,
+  "model_type": "bert",
+  "no_repeat_ngram_size": 0,
+  "num_attention_heads": 12,
+  "num_beams": 1,
+  "num_hidden_layers": 12,
+  "num_return_sequences": 1,
+  "output_attentions": false,
+  "output_hidden_states": false,
+  "output_past": true,
+  "pad_token_id": 0,
+  "pruned_heads": {},
+  "repetition_penalty": 1.0,
+  "temperature": 1.0,
+  "top_k": 50,
+  "top_p": 1.0,
+  "torchscript": false,
+  "type_vocab_size": 2,
+  "use_bfloat16": false,
+  "vocab_size": 31090
+}

flax_model.msgpack ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:69666bb5a436690197ee7e3ff010891140b85cc3dab7013a205df9555cce00ea
+size 437352466

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f318c0c9452000f211edc4bc5b7eb0fea906e55544af8004d3ab09cea02924eb
+size 439757565

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff