Temperature-scaled cosine similarity function?
In the code example, you measure the similarity between two sentences with the inner product between normalized embeddings. However, in Section 3.2 of your technical report, you wrote "In this paper, we adopt the temperature-scaled cosine similarity function as follows".
I have two sets of documents in my use case, denoted A and B. The task is to find the top-k documents in B that are most semantically similar to those in A (by computing the average distance between each document in B and all documents in A). What do you recommend me to do to use as a distance measure between two embeddings, traditional cosine distance, or the temperature-scaled cosine similarity function?
As a bonus question, do you think it would be better if I prepend the instructions to documents in A in addition to those in B, i.e. prepend them to documents in addition to queries? After all, this is a symmetric task, and I suppose some symmetry will help.
Hi @nalzok ,
The inner product between normalized embeddings is mathematically equivalent to cosine similarity function, so they are the same thing.
About the instructions, yes, we add instructions to both for symmetric tasks such as STS (see https://github.com/microsoft/unilm/blob/78b3a48de27c388a0212cfee49fd6dc470c9ecb5/e5/mteb_except_retrieval_eval.py#L68).
Thanks for the reply! As a follow-up question: what's the recommended way to process long texts, particularly those with multiple lines? I'm asking because your template f'Instruct: {task_description}\nQuery: {query}'
includes a \n
character. Would the newline characters in query
interfere with the template?
No, it is okay to include \n
in either the query or documents.