Could you please disclose the full list of training data for embedding supervised finetuning?

#32
by kwang2049 - opened

Although there is a general mention in the paper about it as

"Dataset with annotated negatives: We have prepared retrieval datasets, such as MSMarco [Bajaj et al., 2016] and Natural Questions
(NQ) [Kwiatkowski et al., 2019], in addition to multiple non-retrieval datasets like the Natural Language Inference (NLI) dataset [Bowman et al.,2015]. "

Could you please disclose the full list of the dataset names? This is very important for research work that wants to use Jina or follows it. Thanks in advance.

Jina AI org

hi @kwang2049 yes, we used

  1. snli data from simcse: https://github.com/princeton-nlp/SimCSE#training, 1 hard negative + random negatives.
  2. msmarco, nq, quora-qa, hotpotqa and fever with mined hard negatives.
  3. cc news title description pairs with random negatives. https://huggingface.co/datasets/cc_news

each row consist of 17 items, including 1 anchor, 1 positive and 15 negatives.

Thanks❤️!

kwang2049 changed discussion status to closed

Sign up or log in to comment