lbourdois commited on
Commit
5a2f975
1 Parent(s): a0eface

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +84 -86
README.md CHANGED
@@ -1,86 +1,84 @@
1
- ---
2
- language:
3
- - english
4
- thumbnail:
5
- tags:
6
- - token classification
7
- license: agpl-3.0
8
- datasets:
9
- - EMBO/sd-nlp
10
- metrics:
11
- -
12
- ---
13
-
14
- # sd-geneprod-roles
15
-
16
- ## Model description
17
-
18
- This model is a [RoBERTa base model](https://huggingface.co/roberta-base) that was further trained using a masked language modeling task on a compendium of English scientific textual examples from the life sciences using the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang). It was then fine-tuned for token classification on the SourceData [sd-nlp](https://huggingface.co/datasets/EMBO/sd-nlp) dataset with the `GENEPROD_ROLES` configuration to perform pure context-dependent semantic role classification of bioentities.
19
-
20
-
21
- ## Intended uses & limitations
22
-
23
- #### How to use
24
-
25
- The intended use of this model is to infer the semantic role of gene products (genes and proteins) with regard to the causal hypotheses tested in experiments reported in scientific papers.
26
-
27
- To have a quick check of the model:
28
-
29
- ```python
30
- from transformers import pipeline, RobertaTokenizerFast, RobertaForTokenClassification
31
- example = """<s>The <mask> overexpression in cells caused an increase in <mask> expression.</s>"""
32
- tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_len=512)
33
- model = RobertaForTokenClassification.from_pretrained('EMBO/sd-geneprod-roles')
34
- ner = pipeline('ner', model, tokenizer=tokenizer)
35
- res = ner(example)
36
- for r in res:
37
- print(r['word'], r['entity'])
38
- ```
39
-
40
- #### Limitations and bias
41
-
42
- The model must be used with the `roberta-base` tokenizer.
43
-
44
- ## Training data
45
-
46
- The model was trained for token classification using the [EMBO/sd-nlp dataset](https://huggingface.co/datasets/EMBO/sd-nlp) which includes manually annotated examples.
47
-
48
- ## Training procedure
49
-
50
- The training was run on an NVIDIA DGX Station with 4XTesla V100 GPUs.
51
-
52
- Training code is available at https://github.com/source-data/soda-roberta
53
-
54
- - Model fine-tuned: EMBL/bio-lm
55
- - Tokenizer vocab size: 50265
56
- - Training data: EMBO/sd-nlp
57
- - Dataset configuration: GENEPROD_ROLES
58
- - Training with 48771 examples.
59
- - Evaluating on 13801 examples.
60
- - Training on 15 features: O, I-CONTROLLED_VAR, B-CONTROLLED_VAR, I-MEASURED_VAR, B-MEASURED_VAR
61
- - Epochs: 0.9
62
- - `per_device_train_batch_size`: 16
63
- - `per_device_eval_batch_size`: 16
64
- - `learning_rate`: 0.0001
65
- - `weight_decay`: 0.0
66
- - `adam_beta1`: 0.9
67
- - `adam_beta2`: 0.999
68
- - `adam_epsilon`: 1e-08
69
- - `max_grad_norm`: 1.0
70
-
71
- ## Eval results
72
-
73
- On 7178 example of test set with `sklearn.metrics`:
74
-
75
- ```
76
- precision recall f1-score support
77
-
78
- CONTROLLED_VAR 0.81 0.86 0.83 7835
79
- MEASURED_VAR 0.82 0.85 0.84 9330
80
-
81
- micro avg 0.82 0.85 0.83 17165
82
- macro avg 0.82 0.85 0.83 17165
83
- weighted avg 0.82 0.85 0.83 17165
84
-
85
- {'test_loss': 0.03846803680062294, 'test_accuracy_score': 0.9854472664459946, 'test_precision': 0.8156312625250501, 'test_recall': 0.8535974366443344, 'test_f1': 0.8341825841897008, 'test_runtime': 58.7369, 'test_samples_per_second': 122.206, 'test_steps_per_second': 1.924}
86
- ```
 
1
+ ---
2
+ language: en
3
+ license: agpl-3.0
4
+ tags:
5
+ - token classification
6
+ datasets:
7
+ - EMBO/sd-nlp
8
+ metrics: []
9
+ ---
10
+
11
+
12
+ # sd-geneprod-roles
13
+
14
+ ## Model description
15
+
16
+ This model is a [RoBERTa base model](https://huggingface.co/roberta-base) that was further trained using a masked language modeling task on a compendium of English scientific textual examples from the life sciences using the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang). It was then fine-tuned for token classification on the SourceData [sd-nlp](https://huggingface.co/datasets/EMBO/sd-nlp) dataset with the `GENEPROD_ROLES` configuration to perform pure context-dependent semantic role classification of bioentities.
17
+
18
+
19
+ ## Intended uses & limitations
20
+
21
+ #### How to use
22
+
23
+ The intended use of this model is to infer the semantic role of gene products (genes and proteins) with regard to the causal hypotheses tested in experiments reported in scientific papers.
24
+
25
+ To have a quick check of the model:
26
+
27
+ ```python
28
+ from transformers import pipeline, RobertaTokenizerFast, RobertaForTokenClassification
29
+ example = """<s>The <mask> overexpression in cells caused an increase in <mask> expression.</s>"""
30
+ tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_len=512)
31
+ model = RobertaForTokenClassification.from_pretrained('EMBO/sd-geneprod-roles')
32
+ ner = pipeline('ner', model, tokenizer=tokenizer)
33
+ res = ner(example)
34
+ for r in res:
35
+ print(r['word'], r['entity'])
36
+ ```
37
+
38
+ #### Limitations and bias
39
+
40
+ The model must be used with the `roberta-base` tokenizer.
41
+
42
+ ## Training data
43
+
44
+ The model was trained for token classification using the [EMBO/sd-nlp dataset](https://huggingface.co/datasets/EMBO/sd-nlp) which includes manually annotated examples.
45
+
46
+ ## Training procedure
47
+
48
+ The training was run on an NVIDIA DGX Station with 4XTesla V100 GPUs.
49
+
50
+ Training code is available at https://github.com/source-data/soda-roberta
51
+
52
+ - Model fine-tuned: EMBL/bio-lm
53
+ - Tokenizer vocab size: 50265
54
+ - Training data: EMBO/sd-nlp
55
+ - Dataset configuration: GENEPROD_ROLES
56
+ - Training with 48771 examples.
57
+ - Evaluating on 13801 examples.
58
+ - Training on 15 features: O, I-CONTROLLED_VAR, B-CONTROLLED_VAR, I-MEASURED_VAR, B-MEASURED_VAR
59
+ - Epochs: 0.9
60
+ - `per_device_train_batch_size`: 16
61
+ - `per_device_eval_batch_size`: 16
62
+ - `learning_rate`: 0.0001
63
+ - `weight_decay`: 0.0
64
+ - `adam_beta1`: 0.9
65
+ - `adam_beta2`: 0.999
66
+ - `adam_epsilon`: 1e-08
67
+ - `max_grad_norm`: 1.0
68
+
69
+ ## Eval results
70
+
71
+ On 7178 example of test set with `sklearn.metrics`:
72
+
73
+ ```
74
+ precision recall f1-score support
75
+
76
+ CONTROLLED_VAR 0.81 0.86 0.83 7835
77
+ MEASURED_VAR 0.82 0.85 0.84 9330
78
+
79
+ micro avg 0.82 0.85 0.83 17165
80
+ macro avg 0.82 0.85 0.83 17165
81
+ weighted avg 0.82 0.85 0.83 17165
82
+
83
+ {'test_loss': 0.03846803680062294, 'test_accuracy_score': 0.9854472664459946, 'test_precision': 0.8156312625250501, 'test_recall': 0.8535974366443344, 'test_f1': 0.8341825841897008, 'test_runtime': 58.7369, 'test_samples_per_second': 122.206, 'test_steps_per_second': 1.924}
84
+ ```