Unclear query and passage prefix instructions

#34

by Avditvs - opened Mar 28

Mar 28

The query and passage prefix instructions are not clear. I understand using both query and passage for asymmetric tasks.
But I don't get why query and query should be used for semantic similarity, I get it if the texts are like Quora duplicates.

But the paper "Text Embeddings by Weakly-Supervised Contrastive Pre-training" says:
"For the Quora duplicate retrieval task in the BEIR benchmark, we add prefix “query: ” to all the
questions. For other retrieval tasks, we use “query: ” and “passage: ” prefixes correspondingly."

Shouldn't I use passage: prefix for regular passages ?

Thank you for your help

jing-yi

11 days ago

•

edited 11 days ago

Commenting because I'm curious too!

I'm currently have my text documents embedded with "passage: " prefix and it makes it an inefficient use of resource to store an entirely new set of vectors with the difference being the prefix is different i.e. "query: " .

Am also curious why not use two embeddings that have been prefixed with "passage: " and used for symmetric tasks instead? Why must be "query: "?

intfloat

Owner 11 days ago

It's an empirical observation that using "query: " prefix for symmetric tasks performs slightly better than the "passage: " prefix.

Avditvs

11 days ago

Thanks for your answer. How better doest it performs ? Do we have metrics about this claim ?
I guess we can run the MTEB benchmark script using a passage: prefix to measure it.

Also, this means that if I want to perform both symmetric and asymmetric tasks and to have a maximum performance, I would have to store the vectors with both query and passage prefix

intfloat

Owner 11 days ago

Hey, I just ran the numbers on 10 semantic textual similarity tasks with multilingual-e5-large. The results are as follows:

	BIOSSES	SICK-R	STS12	STS13	STS14	STS15	STS16	STS17	STS22	STS-B	Average
w/ "query: " prefix	82.49	80.23	80.02	81.55	77.72	89.31	85.78	88.11	63.04	87.3	81.55
w/ "passage: " prefix	82.66	77.76	80.56	79.84	78.42	89.27	84.67	87.5	67.24	85.01	81.29

The difference is there but not much. If you have a validation set, it is best to test on your data.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment