Questions about the training data
I really appreciate your work!
I have some questions about the specifics of the training data. Regarding the sentence in the paper: "Given a pretrained LLM, we append an [EOS] token to the end of the query and document, and then feed them into the LLM to obtain the query and document embeddings by taking the last layer [EOS] vector."
What is the format of each training data during training? Is it like "Instruct: {task_definition} \n Query: {q+}[EOS]{doc}[EOS]"? Or are the query and document input separately into the model? Are the queries in a batch the same? Do you have corresponding code on your GitHub that I can refer to for this part? I couldn't find it.
It follows a bi-encoder architecture, the query follows the "Instruct: {task_definition} \n Query: {q+} [EOS]
format, and the document follows the {doc} [EOS]
format. The queries are randomly sampled from the training set, they are not the same in a batch.
We do not have public training code for this model, but you can refer to https://github.com/texttron/tevatron for training embedding models.
It follows a bi-encoder architecture, the query follows the
"Instruct: {task_definition} \n Query: {q+} [EOS]
format, and the document follows the{doc} [EOS]
format. The queries are randomly sampled from the training set, they are not the same in a batch.We do not have public training code for this model, but you can refer to https://github.com/texttron/tevatron for training embedding models.
Thanks very much for your reply!!!
Is the bi-encoder actually a single mistral-7b model (siamese network) or two seperate models (parameters are not the same) ? Besides, what is the specific format of a single training datapoint? In my understanding, it should contain a query, a positive doc, a hard negative doc, and negative sampling is performed during the training process?
Yes, it is a Siamese network with shared parameters.
As you said, a training datapoint has a query, a positive doc and a hard negative doc. Other documents from the same batch are used as in-batch negatives.
Yes, it is a Siamese network with shared parameters.
As you said, a training datapoint has a query, a positive doc and a hard negative doc. Other documents from the same batch are used as in-batch negatives.
Got it. Thanks!