Instruction / Query document embedding question.
Hi there!
I wanted to ask a question because I'm a bit unclear on how to optimize the embeddings of documents..
I understand the README Task, where you embed 2 queries and 2 documents and then check distance between them.
My question is, if I just embed documents, I guess I don't need to add any Instruct or Query is that true?
When I look in utils and I see Given a news summary, retrieve other semantically similar summaries
.
Now, assume I have 100k news summaries. Do I embed them without any Instruct and Query? Just the document? Or is it better to append an instruct before the actual summary to explain for what it would be, something like: Instruct: I have the following news summary, based on it I want you to retrieve similar summaries. Query: {summary}
?
I guess the confusion stems from the fact that you don't have any Instruct for the documents in the DEMO but in the README you say Instruct is needed otherwise performance is degraded.
You should add instructions to the query side only.
1. Do I need to add instructions to the query?
Yes, this is how the model is trained, otherwise you will see a performance degradation.
The above README part is within the context of "add instructions to the query".
@intfloat
what is the chunker that you would recommend to use with this model?
Currently I am using TokenChunker with ~1K tokens with 50 tokens of overlap.