Which is the paraphrase training dataset used for the teacher model?

#2
by yuricampbell - opened

Hi dear community,

I and my team are very grateful for the multilingual models of Sentence Transformers. We also have been using some of them in our work and would like to cite it properly.
Hence, I would like to know which is the training dataset for the MPNet model/module in the SBERT-like model architecture.

In the paper --- Reimers, Nils; Gurevych, Iryna (2020): Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation: arXiv, 2020. Available online at https://arxiv.org/pdf/2004.09813 ---
there is only the following passages which hints to the dataset used in a similar setup:

  1. "XLM-R ← SBERT-paraphrases: We train XLM-R to imitate SBERT-paraphrases, a RoBERTa model trained on more than 50 Million English paraphrase pairs."

  2. "... Even though SBERT-nli-stsb was trained on the STSbenchmark train set, we observe the best performance by SBERT-paraphrase, which was not trained with any STS dataset. Instead, it was trained on a large and broad paraphrase corpus, mainly derived from Wikipedia, which generalizes well to various topics."

Looking further on model card https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2 is not of much help currently.
However, while looking for similar models, I found that a similar question was written about the model sentence-transformers/paraphrase-MiniLM-L6-v2, in this link: https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2/discussions/2

There, @nreimers points kindly to the documentation: https://www.sbert.net/examples/training/paraphrases/README.html#pre-trained-models
Which indeed answers the question of the user CSHorten. However, on the page there is no reference to sentence-transformers/paraphrase-multilingual-mpnet-base-v2

Hence, I wonder if the training data for the teacher model in this case is similar to the data of paraphrase-MiniLM-L6-v2 or if it is similar to the SBERT-paraphrase (from the paper). In the case of the later, I wonder which dataset this exactly might be. Because, on the documentation https://www.sbert.net/examples/training/paraphrases/README.html#datasets, I could not find the reference to a dataset --mainly derived from Wikipedia-- and with --more than 50 million English paraphrase pairs--.

If you are reading this question, N.Reimers, thank you for your amazing work.

Did you get any answer?

Sign up or log in to comment