CovidBERT-NLI
This is the model CovidBERT trained by DeepSet on AllenAI's CORD19 Dataset of scientific articles about coronaviruses.
The model uses the original BERT wordpiece vocabulary and was subsequently fine-tuned on the SNLI and the MultiNLI datasets using the sentence-transformers
library to produce universal sentence embeddings [1] using the average pooling strategy and a softmax loss.
Parameter details for the original training on CORD-19 are available on DeepSet's MLFlow
Base model: deepset/covid_bert_base
from HuggingFace's AutoModel
.
Training time: ~6 hours on the NVIDIA Tesla P100 GPU provided in Kaggle Notebooks.
Parameters:
Parameter | Value |
---|---|
Batch size | 64 |
Training steps | 23000 |
Warmup steps | 1450 |
Lowercasing | True |
Max. Seq. Length | 128 |
Performances: The performance was evaluated on the test portion of the STS dataset using Spearman rank correlation and compared to the performances of similar models obtained with the same procedure to verify its performances.
Model | Score |
---|---|
covidbert-nli (this) |
67.52 |
gsarti/biobert-nli |
73.40 |
gsarti/scibert-nli |
74.50 |
bert-base-nli-mean-tokens [2] |
77.12 |
An example usage for similarity-based scientific paper retrieval is provided in the Covid-19 Semantic Browser repository.
References:
[1] A. Conneau et al., Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
[2] N. Reimers et I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks