CovidBERT-NLI

This is the model CovidBERT trained by DeepSet on AllenAI's CORD19 Dataset of scientific articles about coronaviruses.

The model uses the original BERT wordpiece vocabulary and was subsequently fine-tuned on the SNLI and the MultiNLI datasets using the sentence-transformers library to produce universal sentence embeddings [1] using the average pooling strategy and a softmax loss.

Parameter details for the original training on CORD-19 are available on DeepSet's MLFlow

Base model: deepset/covid_bert_base from HuggingFace's AutoModel.

Training time: ~6 hours on the NVIDIA Tesla P100 GPU provided in Kaggle Notebooks.

Parameters:

Parameter	Value
Batch size	64
Training steps	23000
Warmup steps	1450
Lowercasing	True
Max. Seq. Length	128

Performances: The performance was evaluated on the test portion of the STS dataset using Spearman rank correlation and compared to the performances of similar models obtained with the same procedure to verify its performances.

Model	Score
`covidbert-nli` (this)	67.52
`gsarti/biobert-nli`	73.40
`gsarti/scibert-nli`	74.50
`bert-base-nli-mean-tokens`[2]	77.12

An example usage for similarity-based scientific paper retrieval is provided in the Covid-19 Semantic Browser repository.

References:

[1] A. Conneau et al., Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

[2] N. Reimers et I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks