BioBERT-NLI

This is the model BioBERT [1] fine-tuned on the SNLI and the MultiNLI datasets using the sentence-transformers library to produce universal sentence embeddings [2].

The model uses the original BERT wordpiece vocabulary and was trained using the average pooling strategy and a softmax loss.

Base model: monologg/biobert_v1.1_pubmed from HuggingFace's AutoModel.

Training time: ~6 hours on the NVIDIA Tesla P100 GPU provided in Kaggle Notebooks.

Parameters:

Parameter	Value
Batch size	64
Training steps	30000
Warmup steps	1450
Lowercasing	False
Max. Seq. Length	128

Performances: The performance was evaluated on the test portion of the STS dataset using Spearman rank correlation and compared to the performances of a general BERT base model obtained with the same procedure to verify their similarity.

Model	Score
`biobert-nli` (this)	73.40
`gsarti/scibert-nli`	74.50
`bert-base-nli-mean-tokens`[3]	77.12

An example usage for similarity-based scientific paper retrieval is provided in the Covid Papers Browser repository.

References:

[1] J. Lee et al, BioBERT: a pre-trained biomedical language representation model for biomedical text mining

[2] A. Conneau et al., Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

[3] N. Reimers et I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks