Fine-tuned CovidBERT on Med-Marco Dataset for passage ranking

CovidBERT-MedNLI

This is the model CovidBERT trained by DeepSet on AllenAI's CORD19 Dataset of scientific articles about coronaviruses.

The model uses the original BERT wordpiece vocabulary and was subsequently fine-tuned on the SNLI and the MultiNLI datasets using the sentence-transformers library to produce universal sentence embeddings [1] using the average pooling strategy and a softmax loss.

It is further fine-tuned Med-Marco Dataset. MacAvaney et.al in their paper titled “SLEDGE-Z: A Zero-Shot Baseline for COVID-19 Literature Search” used MedSyn a lexicon of layperson and expert terminology for various medical conditions to filter for medical questions. One can also replace this by UMLs ontologies but the beauty of MedSyn is that the terms are more general human conversation lingo and not terms based on scientific literature.

Parameter details for the original training on CORD-19 are available on DeepSet's MLFlow

Base model: deepset/covid_bert_base from HuggingFace's AutoModel.