File size: 2,257 Bytes
d66184b
 
5a9225d
d66184b
 
5a9225d
d66184b
 
5a9225d
d66184b
 
 
 
b7637ec
d66184b
 
 
0ae1b3c
d66184b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ae1b3c
 
 
 
 
 
 
 
 
 
 
 
d66184b
 
5a9225d
d66184b
5a9225d
 
d66184b
 
 
 
 
5a9225d
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
language: 
- english
thumbnail: 
tags:
- language model
license: 
datasets:
- EMBO/biolang
metrics:
-
---

# bio-lm

## Model description

This model is a [RoBERTa base pre-trained model](https://huggingface.co/roberta-base) that was further trained using a masked language modeling task on a compendium of english scientific textual examples from the life sciences using the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang).

## Intended uses & limitations

#### How to use

The intended use of this model is to be fine-tuned for downstream tasks, token classification in particular.

To have a quick check of the model as-is in a fill-mask task:

```python
from transformers import pipeline, RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_len=512)
text = "Let us try this model to see if it <mask>."
fill_mask = pipeline(
    "fill-mask",
    model='EMBO/bio-lm',
    tokenizer=tokenizer
)
fill_mask(text)
```

#### Limitations and bias

This model should be fine-tuned on a specifi task like token classification.
The model must be used with the `roberta-base` tokenizer.

## Training data

The model was trained with a masked language modeling taskon the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang) wich includes 12Mio examples from abstracts and figure legends extracted from papers published in life sciences.

## Training procedure

The training was run on a NVIDIA DGX Station with 4XTesla V100 GPUs.

Training code is available at https://github.com/source-data/soda-roberta

- Command: `python -m lm.train /data/json/oapmc_abstracts_figs/ MLM`
- Tokenizer vocab size: 50265
- Training data: EMBO/biolang MLM
- Training with: 12005390 examples
- Evaluating on: 36713 examples
- Epochs: 3.0
- `per_device_train_batch_size`: 16
- `per_device_eval_batch_size`: 16
- `learning_rate`: 5e-05
- `weight_decay`: 0.0
- `adam_beta1`: 0.9
- `adam_beta2`: 0.999
- `adam_epsilon`: 1e-08
- `max_grad_norm`: 1.0
- tensorboard run: lm-MLM-2021-01-27T15-17-43.113766

End of training:
```
trainset: 'loss': 0.8653350830078125
validation set: 'eval_loss': 0.8192330598831177, 'eval_recall': 0.8154601116513597
```

## Eval results

Eval on test set:
```
recall: 0.814471959728645
```