Cosine Similarity is always HIGH

#7
by artificialgenerations4gsdfg - opened

You rank top of the MTEB for STS tasks (nice work, thank you!) so I wanted to see if I could improve a current usecase we have for document search, and finding relevant passages. But I noticed that locally, any two embeddings share an cos_sim of 0.75 or more! I would expect two random sentences to be roughly orthogonal in embedding space so I though I was doing something wrong.

But alas, playing with the HF space https://huggingface.co/spaces/aruntruminds/thenlper-gte-large I noticed the same results.

Ex:

Query: The person tells there name.

| Sim  | Sentences                      |
|:-----|--------------------------------|
| 0.84 | My name is Edward.             |
| 0.89 | What is your name?             |
| 0.79 | rvjxme eogh38ahjf eogaljg ads. |

What could be going wrong? I've actually noticed this behavior in other embedding models as well, and I have never seen an model give a negative similarity score, which is also counterintuitive since two sentences could actually be semantically opposite.

A little guidance on improving accuracy please?

I've had very good success with this model on retrieval tasks, so I'd didn't initially believe you. I was very surprised to reproduce your example and get the same results. I too would be interested hear more.

@mattgwwalker It makes me wonder if they were biased in their training of positive and orthogonal (and negative) examples. Ideally, at the same time you train similarity detection, you've gotta be training dissimilarity too (or so one would think). And if your using cosine sim in the score, I should think you'd generate large gradients if orthogonal sentences were rated at 0.79, when it should have been 0.

I haven't found a model that doesn't behave this way though, so, I must be missing something.

The embedding model is trained based on the training objective of contrastive learning. The scores provided by the model focus on distinguishing the partial order relationship between relevant and irrelevant documents, and cannot be used as a strong reference for determining relevance. Other models offer more detailed discussions on this matter, which can be referred to

https://huggingface.co/intfloat/multilingual-e5-large/discussions/10#64ca669c38837b12d5eed6a4

Since both are English, the similarity is still high, what say?

The embedding model is trained based on the training objective of contrastive learning. The scores provided by the model focus on distinguishing the partial order relationship between relevant and irrelevant documents, and cannot be used as a strong reference for determining relevance. Other models offer more detailed discussions on this matter, which can be referred to

https://huggingface.co/intfloat/multilingual-e5-large/discussions/10#64ca669c38837b12d5eed6a4

I'm not sure I understand this. How does it both distinguish the partial order relationship between relevant and irrelevant documents and not provide a reference for determining relevance? Can't we just do probability calibration?

Anyways, it seems to work well, afaict.

thenlper changed discussion status to closed

Sign up or log in to comment