intfloat/multilingual-e5-large · Finetuning Embeddings

First, I want to thank you for this fantastic model! I use its embeddings with an approximate nearest neighbor search to cluster similar documents.

In your paper, you suggest fine-tuning with MS-Marco and NQ formats. However, I use document pairs with annotated cosine similarity scores.

Is that format okay, or should I look into something else for generating rich embeddings?

Also, my domain is kind of special, and labeled data is difficult to obtain, but I have a few million unlabeled domain-specific sentences.

Is there some unsupervised pretraining/finetuning you can suggest?