tlemberger commited on
Commit
6bbac4f
1 Parent(s): 20edabf

update training param and changed name

Browse files
Files changed (1) hide show
  1. README.md +17 -16
README.md CHANGED
@@ -7,16 +7,16 @@ tags:
7
  -
8
  license: agpl-3.0
9
  datasets:
10
- - EMBO/sd-nlp `PANELIZATION`
11
  metrics:
12
  -
13
  ---
14
 
15
- # sd-panels
16
 
17
  ## Model description
18
 
19
- This model is a [RoBERTa base model](https://huggingface.co/roberta-base) that was further trained using a masked language modeling task on a compendium of english scientific textual examples from the life sciences using the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang). It was then fine-tuned for token classification on the SourceData [sd-nlp](https://huggingface.co/datasets/EMBO/sd-nlp) dataset with the `PANELIZATION` task to perform 'parsing' or 'segmentation' of figure legends into fragments corresponding to sub-panels.
20
 
21
  Figures are usually composite representations of results obtained with heterogenous experimental approaches and systems. Breaking figures into panels allows to identify more coherent descriptions of individual scientific experiments.
22
 
@@ -32,7 +32,7 @@ To have a quick check of the model:
32
  from transformers import pipeline, RobertaTokenizerFast, RobertaForTokenClassification
33
  example = """Fig 4. a, Volume density of early (Avi) and late (Avd) autophagic vacuoles.a, Volume density of early (Avi) and late (Avd) autophagic vacuoles from four independent cultures. Examples of Avi and Avd are shown in b and c, respectively. Bars represent 0.4����m. d, Labelling density of cathepsin-D as estimated in two independent experiments. e, Labelling density of LAMP-1."""
34
  tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_len=512)
35
- model = RobertaForTokenClassification.from_pretrained('EMBO/sd-panels')
36
  ner = pipeline('ner', model, tokenizer=tokenizer)
37
  res = ner(example)
38
  for r in res: print(r['word'], r['entity'])
@@ -44,7 +44,7 @@ The model must be used with the `roberta-base` tokenizer.
44
 
45
  ## Training data
46
 
47
- The model was trained for token classification using the [EMBO/sd-nlp `PANELIZATION`](https://huggingface.co/datasets/EMBO/sd-nlp) dataset wich includes manually annotated examples.
48
 
49
  ## Training procedure
50
 
@@ -52,15 +52,16 @@ The training was run on a NVIDIA DGX Station with 4XTesla V100 GPUs.
52
 
53
  Training code is available at https://github.com/source-data/soda-roberta
54
 
55
- - Command: `python -m tokcl.train PANELIZATION --num_train_epochs=10`
56
  - Tokenizer vocab size: 50265
57
- - Training data: EMBO/sd-nlp NER
 
58
  - TTraining with 2175 examples.
59
  - Evaluating on 622 examples.
60
  - Training on 2 features: `O`, `B-PANEL_START`
61
- - Epochs: 10.0
62
- - `per_device_train_batch_size`: 32
63
- - `per_device_eval_batch_size`: 32
64
  - `learning_rate`: 0.0001
65
  - `weight_decay`: 0.0
66
  - `adam_beta1`: 0.9
@@ -70,14 +71,14 @@ Training code is available at https://github.com/source-data/soda-roberta
70
 
71
  ## Eval results
72
 
73
- Testing on 337 examples from test set with `sklearn.metrics`:
74
 
75
- ```
76
  precision recall f1-score support
77
 
78
- PANEL_START 0.88 0.97 0.92 785
79
 
80
- micro avg 0.88 0.97 0.92 785
81
- macro avg 0.88 0.97 0.92 785
82
- weighted avg 0.88 0.97 0.92 785
83
  ```
 
7
  -
8
  license: agpl-3.0
9
  datasets:
10
+ - EMBO/sd-figures `PANELIZATION`
11
  metrics:
12
  -
13
  ---
14
 
15
+ # sd-panelization
16
 
17
  ## Model description
18
 
19
+ This model is a [RoBERTa base model](https://huggingface.co/roberta-base) that was further trained using a masked language modeling task on a compendium of english scientific textual examples from the life sciences using the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang). It was then fine-tuned for token classification on the SourceData [sd-figures](https://huggingface.co/datasets/EMBO/sd-figures) dataset with the `PANELIZATION` task to perform 'parsing' or 'segmentation' of figure legends into fragments corresponding to sub-panels.
20
 
21
  Figures are usually composite representations of results obtained with heterogenous experimental approaches and systems. Breaking figures into panels allows to identify more coherent descriptions of individual scientific experiments.
22
 
 
32
  from transformers import pipeline, RobertaTokenizerFast, RobertaForTokenClassification
33
  example = """Fig 4. a, Volume density of early (Avi) and late (Avd) autophagic vacuoles.a, Volume density of early (Avi) and late (Avd) autophagic vacuoles from four independent cultures. Examples of Avi and Avd are shown in b and c, respectively. Bars represent 0.4����m. d, Labelling density of cathepsin-D as estimated in two independent experiments. e, Labelling density of LAMP-1."""
34
  tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_len=512)
35
+ model = RobertaForTokenClassification.from_pretrained('EMBO/sd-panelization')
36
  ner = pipeline('ner', model, tokenizer=tokenizer)
37
  res = ner(example)
38
  for r in res: print(r['word'], r['entity'])
 
44
 
45
  ## Training data
46
 
47
+ The model was trained for token classification using the [EMBO/sd-figures `PANELIZATION`](https://huggingface.co/datasets/EMBO/sd-panels) dataset wich includes manually annotated examples.
48
 
49
  ## Training procedure
50
 
 
52
 
53
  Training code is available at https://github.com/source-data/soda-roberta
54
 
55
+ - Model fine-tuned: EMMBO/bio-lm
56
  - Tokenizer vocab size: 50265
57
+ - Training data: EMBO/sd-figures
58
+ - Dataset configuration: PANELIZATION
59
  - TTraining with 2175 examples.
60
  - Evaluating on 622 examples.
61
  - Training on 2 features: `O`, `B-PANEL_START`
62
+ - Epochs: 1.3
63
+ - `per_device_train_batch_size`: 16
64
+ - `per_device_eval_batch_size`: 16
65
  - `learning_rate`: 0.0001
66
  - `weight_decay`: 0.0
67
  - `adam_beta1`: 0.9
 
71
 
72
  ## Eval results
73
 
74
+ Testing on 1802 examples from test set with `sklearn.metrics`:
75
 
76
+ ```
77
  precision recall f1-score support
78
 
79
+ PANEL_START 0.89 0.95 0.92 5427
80
 
81
+ micro avg 0.89 0.95 0.92 5427
82
+ macro avg 0.89 0.95 0.92 5427
83
+ weighted avg 0.89 0.95 0.92 5427
84
  ```