general vs specific dataset

#1
by nickprock - opened

Hi @pritamdeka ,
what is the advantage of using MS-MARCO that has generalist texts rather than a specific dataset with scientific texts?
Thanks in advance.

Nico

Hi @nickprock Thanks for the interest in the model. To answer your question, training a model over other specific scientific text would result in better embeddings. However, till date there is no scientific text dataset similar to MS-MARCO which is predominantly used for training models for information retrieval. That is why I have trained this model with MS-MARCO. However, if there is a similar dataset for scientific or medical information retrieval, then this particular model can be trained over that data to get better embeddings. Research has shown that in such situations where there is less amount of data available, intermediate training works better. So, in this situation, since the model has already learnt using the MS-MARCO dataset, it can be further trained to enhance performance. I have myself experimented with the SCIFACT dataset and I have published those results as well. In subsequent research of mine, I have found that it also works in NLI datasets as well. I am providing link to one such research work on intermediate training . Also, feel free to reach out if you have doubts, always happy to help.

https://arxiv.org/abs/1811.01088

Thank you for your reply @pritamdeka .
I am doing some experiments with sentence encoders; I finetuned one for Italian with STSB dataset. Now I would like to give it a try with microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext but using the QQP_triplets format.
Your answer was very helpful.

That's great @nickprock . Do not hesitate to contact if you have any doubts. And do lemme know how your experiment results work out. Thanks.

nickprock changed discussion status to closed

Sign up or log in to comment