Dataset publication & questions

#3
by tomaarsen HF staff - opened

Hello!

I have a few questions:

  1. Do you have plans to release the 500k synthetically GPT-3.5/4 generated pairs (presumably they're pairs?) for supervised finetuning?
  2. Do you apply these 500k pairs on top of the ~1.6M pairs that were already being used for mE5-small/base/large? I.e. you now finetune on roughly 2.1M pairs?
  3. What is the cross-encoder that you use in knowledge distillation in the Supervised Finetuning phase? Also, is this directly used to help train the model, or used to filter the training data to improve the training data quality?
  4. I see that you're using a batch size of 512 with fine-tuning. Is this still with InfoNCE with in-batch negatives? If so, have you experimented with using a higher batch size through https://arxiv.org/pdf/2101.06983.pdf? (E.g. like: https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/CachedMultipleNegativesRankingLoss.py allows us to get the benefit of in-batch negatives while allowing us to use small mini-batches. With this loss function you can easily do a 64k batch size if you really wanted.)
  • Tom Aarsen

Thanks for the questions!

  1. Do you have plans to release the 500k synthetically GPT-3.5/4 generated pairs (presumably they're pairs?) for supervised finetuning?

Unfortunately, we do not have plans to release them for now.

  1. Do you apply these 500k pairs on top of the ~1.6M pairs that were already being used for mE5-small/base/large? I.e. you now finetune on roughly 2.1M pairs?

Not exactly, for multilingual-e5-large-instruct, we use the new data mixture from the E5-mistral paper, which has roughly 1.8M pairs. The main difference is that we sub-sample the MS-MARCO passage ranking data and add the T2Ranking & synthetic data.

  1. What is the cross-encoder that you use in knowledge distillation in the Supervised Finetuning phase? Also, is this directly used to help train the model, or used to filter the training data to improve the training data quality?

It is similar to the English E5 paper except we train a cross-encoder initialized with xlm-roberta-large. This trained cross-encoder is used for KL divergence distillation. Briefly speaking, we compute the KL divergence between this cross-encoder teacher and the bi-encoder student models. This leads to better results and simplified procedure compared to filtering training data with cross-encoder.

  1. I see that you're using a batch size of 512 with fine-tuning. Is this still with InfoNCE with in-batch negatives? If so, have you experimented with using a higher batch size through https://arxiv.org/pdf/2101.06983.pdf?

Yes, we use InfoNCE loss with both in-batch negatives and mined hard negatives. Thanks for the information on the GradCache technique!

Hi @intfloat would it be possible to release code or prompts used to generate the dataset? I understand not wanting to release data generated from GPT-4, but it would be great for the community to be able to reproduce the dataset. I have tried with the prompts in the [Wang 2023] paper, but the paper omits a lot of details so it's difficult to reproduce. Some code or prompts in the unilm repo would go a long way to helping with this.

Sign up or log in to comment