Edit model card
YAML Metadata Error: "license" does not match any of the allowed types
YAML Metadata Error: "language[0]" with value "english" is not valid. It must be an ISO 639-1, 639-2 or 639-3 code (two/three letters), or a special value like "code", "multilingual". If you want to use BCP-47 identifiers, you can specify them in language_bcp47.


Model description

This model is a RoBERTa base pre-trained model that was further trained using a masked language modeling task on a compendium of english scientific textual examples from the life sciences using the BioLang dataset.

Intended uses & limitations

How to use

The intended use of this model is to be fine-tuned for downstream tasks, token classification in particular.

To have a quick check of the model as-is in a fill-mask task:

from transformers import pipeline, RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_len=512)
text = "Let us try this model to see if it <mask>."
fill_mask = pipeline(

Limitations and bias

This model should be fine-tuned on a specifi task like token classification. The model must be used with the roberta-base tokenizer.

Training data

The model was trained with a masked language modeling taskon the BioLang dataset wich includes 12Mio examples from abstracts and figure legends extracted from papers published in life sciences.

Training procedure

The training was run on a NVIDIA DGX Station with 4XTesla V100 GPUs.

Training code is available at https://github.com/source-data/soda-roberta

  • Command: python -m lm.train /data/json/oapmc_abstracts_figs/ MLM
  • Tokenizer vocab size: 50265
  • Training data: EMBO/biolang MLM
  • Training with: 12005390 examples
  • Evaluating on: 36713 examples
  • Epochs: 3.0
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • tensorboard run: lm-MLM-2021-01-27T15-17-43.113766

End of training:

trainset: 'loss': 0.8653350830078125
validation set: 'eval_loss': 0.8192330598831177, 'eval_recall': 0.8154601116513597

Eval results

Eval on test set:

recall: 0.814471959728645
Downloads last month
Hosted inference API
Mask token: <mask>
This model can be loaded on the Inference API on-demand.

Dataset used to train EMBO/bio-lm