Embeddings and Cosine Similarity

#17
by karmiq - opened

Hello @intfloat , I have a question related to using the embeddings generated by the model to compute cosine similarity.

I am using sklearn.metrics.pairwise.cosine_similarity like this, to demonstrate the concept of similarity in a talk:

import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model_e5 = SentenceTransformer("intfloat/multilingual-e5-base")

words = ["cat", "mouse", "house", "clock"]

df_emb_e5 = pd.DataFrame(
    [  [model_e5.encode("query: " + word)] for word in words ],
    index=words, columns=["embeddings"])

def compute_similarities(query_emb, df_emb):
  similarities = {}

  for word, word_emb in df_emb["embeddings"].items():
      similarities[word] = cosine_similarity(
          normalize([query_emb]),
          normalize([word_emb]),
        )[0][0]

  return similarities

query = "cat"
query_emb = model_e5.encode("query: " + query)

similarities = compute_similarities(query_emb, df_emb_e5)

# => Word         Similarity
# => cat ⇔ cat    1.0000
# => cat ⇔ mouse  0.8972
# => cat ⇔ house  0.8827
# => cat ⇔ clock  0.8717

The returned values are really close to each other, where I would intuitively expect a much bigger difference.

When I try the same with the sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model, I'm getting a much bigger difference:

model_lm = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

df_emb_lm = pd.DataFrame(
    [  [model_lm.encode(word)] for word in words ],
    index=words, columns=["embeddings"])

query = "cat"
query_emb = model_lm.encode(query)

similarities = compute_similarities(query_emb, df_emb_lm)

# => Word         Similarity
# => cat ⇔ cat    1.0000
# => cat ⇔ mouse  0.3129
# => cat ⇔ house  0.2636
# => cat ⇔ clock  0.0392

This corresponds much better with the intuitive understanding. I've tried using sklearn.preprocessing.normalize or reducing the dimensions with PCA, but without much success. I have also used the passage: prefix, again without much success.

I have found the Github issue, where you state:

For tasks like text retrieval or semantic similarity, what matters is the relative order of the scores instead of the absolute values, so this is generally not an issue.

However, I'm curious why the results differ so much between these two models. I would be very, very grateful for any advice or pointers how the cosine similarity computation could be improved.

When experimenting further, I've found that scaling the scores with sklearn.preprocessing.MinMaxScaler makes the scores much more aligned with intuitive understanding of the differences. Does this approach make sense, in your opinion?

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

normalized_scores = \
    scaler.fit_transform(
        np.array(
            list(similarities.values())
        ).reshape(-1, 1)).flatten()

normalized_similarities = { key: normalized_scores[i] for i, key in enumerate(similarities) }

# => Word	        Similarity
# => cat ⇔ cat	    1.0000
# => cat ⇔ mouse	0.1981
# => cat ⇔ house	0.0857
# => cat ⇔ clock	0.0000
Owner

The underlying reason is that we use a much lower temperature (0.01) for InfoNCE loss. For sentence-transformers models, they use 0.05 by default. Lower temperature will push the cosine scores to concentrate in a narrow range, but the ranking performance will improve based on our preliminary experiments.

Many thanks for the explanation, @intfloat ! In the end, I've decided to keep both the "original" and normalized similarities on the slide, to prevent any confusion. The output matches intuitive understanding quite well, I think:

Word          Similarity
cat ⇔ cat     1.0000
cat ⇔ kitten  0.7378
cat ⇔ mouse   0.3186
cat ⇔ house   0.0781
cat ⇔ clock   0.0000
karmiq changed discussion status to closed

Sign up or log in to comment