Best instructions for clustering and semantic similarity

#29
by rmilliere - opened

The model card gives an example instruction for retrieval.

What are the recommended instructions to get embeddings optimized for either clustering or sentence similarity instead of retrieval?

NVIDIA org

Thank you for asking the question. All instruction prefix examples (including clustering, STS, classification, etc) are available in Table 7 of our NV-Embed paper: https://arxiv.org/pdf/2405.17428

Thanks, I missed that in the appendix.
If anyone else is looking for this information, here are the relevant instructions:

  • STS: "Retrieve semantically similar text."
  • Clustering (adjusted for a generic task): "Identify the topic or theme of X" (e.g., "Identify the topic or theme of the given sentences" for a corpus of sentences)

Sign up or log in to comment