Can you tell me the detail about negative sampling for benchmark data?

#19
by lightapple - opened

Hi, I am really interested in your research using LLM in dense retrieval task.

I am trying to replicate your training method without synthetic data using GPT, but I have some trouble with following your negative sampling method.
As you mentioned in your paper, the training set of following benchmark dataset was used during training;
ELI5, HotpotQA, FEVER, MIRACL, MS MARCO passage ranking and document ranking, NQ, NLI, SQuAD, TriviaQA, Quora Duplicate Questions, MrTyDi, DuReader, and T2Ranking

On the other hand, you wrote that for the dataset without hard negatives, you used mE5 top 100 for the negative sampling instead.
I am wondering that

  1. Did you use 100 negative sample for the dataset without hard negatives?
  2. Which benchmark dataset did you use mE5 for negative sampling? I might guess all the dataset without MS MARCO, MR. TyDi and T2Ranking, but i am not certain.

Thanks for your great research and sharing.

Hi @lightapple ,

  1. Did you use 100 negative sample for the dataset without hard negatives?
    We use multilingual-e5-base to get top 100 candidates from the document pool for that dataset, and then randomly sample 1 as hard negative from these 100 candidates during training.

  2. Which benchmark dataset did you use mE5 for negative sampling?
    If I remember correctly, NLI / MrTyDi / DuReader / T2Ranking / MIRACL already provide hard negatives, so no need to mine by myself. For MS MARCO passage ranking and NQ, we re-use hard negatives from the SimLM paper. For the remaining datasets, we use multilingual-e5-base to mine hard negatives.

Thanks! your kind answer helps me a lot!

Sign up or log in to comment