Unclear query and passage prefix instructions

#34
by Avditvs - opened

The query and passage prefix instructions are not clear. I understand using both query and passage for asymmetric tasks.
But I don't get why query and query should be used for semantic similarity, I get it if the texts are like Quora duplicates.

But the paper "Text Embeddings by Weakly-Supervised Contrastive Pre-training" says:
"For the Quora duplicate retrieval task in the BEIR benchmark, we add prefix “query: ” to all the
questions. For other retrieval tasks, we use “query: ” and “passage: ” prefixes correspondingly."

Shouldn't I use passage: prefix for regular passages ?

Thank you for your help

Commenting because I'm curious too!

I'm currently have my text documents embedded with "passage: " prefix and it makes it an inefficient use of resource to store an entirely new set of vectors with the difference being the prefix is different i.e. "query: " .

Am also curious why not use two embeddings that have been prefixed with "passage: " and used for symmetric tasks instead? Why must be "query: "?

It's an empirical observation that using "query: " prefix for symmetric tasks performs slightly better than the "passage: " prefix.

Thanks for your answer. How better doest it performs ? Do we have metrics about this claim ?
I guess we can run the MTEB benchmark script using a passage: prefix to measure it.

Also, this means that if I want to perform both symmetric and asymmetric tasks and to have a maximum performance, I would have to store the vectors with both query and passage prefix

Hey, I just ran the numbers on 10 semantic textual similarity tasks with multilingual-e5-large. The results are as follows:

BIOSSES SICK-R STS12 STS13 STS14 STS15 STS16 STS17 STS22 STS-B Average
w/ "query: " prefix 82.49 80.23 80.02 81.55 77.72 89.31 85.78 88.11 63.04 87.3 81.55
w/ "passage: " prefix 82.66 77.76 80.56 79.84 78.42 89.27 84.67 87.5 67.24 85.01 81.29

The difference is there but not much. If you have a validation set, it is best to test on your data.

Sign up or log in to comment