Sample ratio for training data

#25
by coverwhole - opened

Hi,
I was so impressed by your work and I'd like to reproduce from curating dataset. I read that training dataset is composed of various datasets and you denote the sample ratio for some "sub-dataset". e.g. Quora duplicate Questions(sample ratio 0.1).
I wonder that what the "sample ratio" means because I was somewhat confused that the sum of sample ratio is not 1 and some dataset does not have any explicit sample ratio.
How can I understand and reproduce it?

Thank you

By Quora Questions (sample ratio 0.1), we mean that we sample 10% of its training data. For datasets without explicit sample ratio, we simply add all the training data without sampling.

Hope that clarifies our setting.

Thank you for your kind response @intfloat !
At first glance, it is natural to use all the training data to achieve better performance but it is interesting that only sampled data from each dataset is enough.
Is there any insight which make you use subset of whole training data for each dataset?

The main reason is that the evaluation setting is (mostly) zero-shot. Many evaluation tasks are not seen during training, so we do not want the model to overfit particular training dataset.

Similar observations in the paper Large Dual Encoders Are Generalizable Retrievers: using 10% of labeled data leads to better model on the BEIR zero-shot retrieval benchmark.

coverwhole changed discussion status to closed

Sign up or log in to comment