How do you suggest using Colbert vectors ?

#16
by EquinoxElahin - opened

Because of colbert vectors concept we get 1 vector for each token in our sentences.
So if my question is n tokens : I got (n+1, 1024) sized-vectors.
And let's say I have M documents I got m times (?, 1024) sized-vectors. ? depends of the number of document token.

Thinking quick I would have say that I will mean my vectors to have question_embedding size (1, 1024) and for my documents (m, 1024) vectors. Then I will do a cosine research or whatever.

But I read on your GIthub "Different from other embedding models using mean pooling, BGE uses the last hidden state of [cls] as the sentence embedding: sentence_embeddings = model_output[0][:, 0]. If you use mean pooling, there will be a significant decrease in performance. Therefore, make sure to use the correct method to obtain sentence vectors. You can refer to the usage method we provide."
For the dense vector I understand, but for the colbert vectors I don't know how to get it . Could you explained?

Thanks,

Beijing Academy of Artificial Intelligence org

The colbert scores is different from the dense scores. You can refer to https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/bge_m3.py#L90 for the method to compute colbert score.

Sign up or log in to comment