intfloat/e5-mistral-7b-instruct · Instruction / Query document embedding question.

eek

Jan 31, 2024

Hi there!

I wanted to ask a question because I'm a bit unclear on how to optimize the embeddings of documents..

I understand the README Task, where you embed 2 queries and 2 documents and then check distance between them.

My question is, if I just embed documents, I guess I don't need to add any Instruct or Query is that true?

When I look in utils and I see Given a news summary, retrieve other semantically similar summaries.

Now, assume I have 100k news summaries. Do I embed them without any Instruct and Query? Just the document? Or is it better to append an instruct before the actual summary to explain for what it would be, something like: Instruct: I have the following news summary, based on it I want you to retrieve similar summaries. Query: {summary} ?

I guess the confusion stems from the fact that you don't have any Instruct for the documents in the DEMO but in the README you say Instruct is needed otherwise performance is degraded.

intfloat

Owner Feb 2, 2024

You should add instructions to the query side only.

1. Do I need to add instructions to the query?

Yes, this is how the model is trained, otherwise you will see a performance degradation.

The above README part is within the context of "add instructions to the query".

mendesgeo

Mar 1, 2024

•

edited Mar 1, 2024

@intfloat what is the chunker that you would recommend to use with this model?
Currently I am using TokenChunker with ~1K tokens with 50 tokens of overlap.

aleyfin

May 2

@eek , what did you end up using for just embedding your texts? I am also wondering whether I should add some text before each text. Currently, I am just encoding the texts only -- and my goal is to just create an embedding space for topic modelling.