pritamdeka/S-PubMedBert-MS-MARCO · general vs specific dataset

May 15, 2023

Hi @pritamdeka ,
what is the advantage of using MS-MARCO that has generalist texts rather than a specific dataset with scientific texts?
Thanks in advance.

Nico

pritamdeka

Owner May 15, 2023

Hi @nickprock Thanks for the interest in the model. To answer your question, training a model over other specific scientific text would result in better embeddings. However, till date there is no scientific text dataset similar to MS-MARCO which is predominantly used for training models for information retrieval. That is why I have trained this model with MS-MARCO. However, if there is a similar dataset for scientific or medical information retrieval, then this particular model can be trained over that data to get better embeddings. Research has shown that in such situations where there is less amount of data available, intermediate training works better. So, in this situation, since the model has already learnt using the MS-MARCO dataset, it can be further trained to enhance performance. I have myself experimented with the SCIFACT dataset and I have published those results as well. In subsequent research of mine, I have found that it also works in NLI datasets as well. I am providing link to one such research work on intermediate training . Also, feel free to reach out if you have doubts, always happy to help.

https://arxiv.org/abs/1811.01088

nickprock

May 16, 2023

Thank you for your reply @pritamdeka .
I am doing some experiments with sentence encoders; I finetuned one for Italian with STSB dataset. Now I would like to give it a try with microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext but using the QQP_triplets format.
Your answer was very helpful.

pritamdeka

Owner May 16, 2023

That's great @nickprock . Do not hesitate to contact if you have any doubts. And do lemme know how your experiment results work out. Thanks.

nickprock changed discussion status to closed May 17, 2023