Finetuning Embeddings
#26
by
HuggySSO
- opened
First, I want to thank you for this fantastic model! I use its embeddings with an approximate nearest neighbor search to cluster similar documents.
In your paper, you suggest fine-tuning with MS-Marco and NQ formats. However, I use document pairs with annotated cosine similarity scores.
Is that format okay, or should I look into something else for generating rich embeddings?
Also, my domain is kind of special, and labeled data is difficult to obtain, but I have a few million unlabeled domain-specific sentences.
Is there some unsupervised pretraining/finetuning you can suggest?