Thomas Lemberger commited on
Commit
d66184b
1 Parent(s): 7af11f5

initial card

Browse files
Files changed (1) hide show
  1. README.md +95 -0
README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ -
4
+ -
5
+ thumbnail:
6
+ tags:
7
+ -
8
+ -
9
+ -
10
+ license:
11
+ datasets:
12
+ -
13
+ -
14
+ metrics:
15
+ -
16
+ -
17
+ ---
18
+
19
+ # MyModelName
20
+
21
+ ## Model description
22
+
23
+ This model is a [RoBERTa base model](https://huggingface.co/roberta-base) pre-trained model further trained with masked language modeling task on a compendium of english scientific textual examples from the life sciences using the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang).
24
+
25
+ ## Intended uses & limitations
26
+
27
+ #### How to use
28
+
29
+ The intended use of this model is to be fine-tuned for downstream tasks, token classification in particular.
30
+
31
+
32
+ To have a quick check of the model as-is in a fill-mask task:
33
+
34
+ ```python
35
+ from transformers import pipeline, RobertaTokenizerFast
36
+ tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_len=512)
37
+ text = "Let us try this model to see if it <mask>."
38
+ fill_mask = pipeline(
39
+ "fill-mask",
40
+ model='EMBO/bio-lm',
41
+ tokenizer=tokenizer
42
+ )
43
+ fill_mask(text)
44
+ ```
45
+
46
+ #### Limitations and bias
47
+
48
+ This model should be fine-tuned on a specifi task like token classification.
49
+ The model must be used with the `roberta-base` tokenizer.
50
+
51
+ ## Training data
52
+
53
+ The model was trained with a masked language modeling taskon the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang) wich includes 12Mio examples from abstracts and figure legends extracted from papers published in life sciences.
54
+
55
+ ## Training procedure
56
+
57
+ The training was run on a NVIDIA DGX Station with 4XTesla V100 GPUs.
58
+
59
+ Training code is available at https://github.com/source-data/soda-roberta
60
+
61
+ - Command: `python -m lm.train /data/json/oapmc_abstracts_figs/ MLM`
62
+ - Tokenizer vocab size: 50265
63
+ - Training data: bio_lang/MLM
64
+ - Training with: 12005390 examples.
65
+ - Evaluating on: 36713 examples.
66
+ - Epochs :3.0
67
+ - per_device_train_batch_size: 16,
68
+ - per_device_eval_batch_size; 16,
69
+ - learning_rate: 5e-05,
70
+ - weight_decay: 0.0,
71
+ - adam_beta1: 0.9,
72
+ - adam_beta2: 0.999,
73
+ - adam_epsilon: 1e-08,
74
+ - max_grad_norm: 1.0,
75
+ - tensorboard run: lm-MLM-2021-01-27T15-17-43.113766
76
+
77
+ End of training eval on validation set:
78
+ ```
79
+ {'loss': 0.8653350830078125, 'learning_rate': 6.708070119323685e-08, 'epoch': 2.995975157928406}
80
+ {'eval_loss': 0.8192330598831177, 'eval_recall': 0.8154601116513597, 'epoch': 2.995975157928406}
81
+ ```
82
+
83
+
84
+ ## Eval results
85
+
86
+ Eval on test set:
87
+ {'test_loss': 0.8240728974342346, 'test_recall': 0.814471959728645}
88
+
89
+ ### BibTeX entry and citation info
90
+
91
+ ```bibtex
92
+ @inproceedings{...,
93
+ year={2020}
94
+ }
95
+ ```