Sample ratio for training data

#25

by coverwhole - opened Jan 30, 2024

Jan 30, 2024

Hi,
I was so impressed by your work and I'd like to reproduce from curating dataset. I read that training dataset is composed of various datasets and you denote the sample ratio for some "sub-dataset". e.g. Quora duplicate Questions(sample ratio 0.1).
I wonder that what the "sample ratio" means because I was somewhat confused that the sum of sample ratio is not 1 and some dataset does not have any explicit sample ratio.
How can I understand and reproduce it?

Thank you

intfloat

Owner Jan 30, 2024

By Quora Questions (sample ratio 0.1), we mean that we sample 10% of its training data. For datasets without explicit sample ratio, we simply add all the training data without sampling.

Hope that clarifies our setting.

coverwhole

Jan 30, 2024

Thank you for your kind response @intfloat !
At first glance, it is natural to use all the training data to achieve better performance but it is interesting that only sampled data from each dataset is enough.
Is there any insight which make you use subset of whole training data for each dataset?

intfloat

Owner Jan 30, 2024

The main reason is that the evaluation setting is (mostly) zero-shot. Many evaluation tasks are not seen during training, so we do not want the model to overfit particular training dataset.

Similar observations in the paper Large Dual Encoders Are Generalizable Retrievers: using 10% of labeled data leads to better model on the BEIR zero-shot retrieval benchmark.

coverwhole changed discussion status to closed Jan 31, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment