Embeddings and Cosine Similarity
Hello @intfloat , I have a question related to using the embeddings generated by the model to compute cosine similarity.
I am using sklearn.metrics.pairwise.cosine_similarity
like this, to demonstrate the concept of similarity in a talk:
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model_e5 = SentenceTransformer("intfloat/multilingual-e5-base")
words = ["cat", "mouse", "house", "clock"]
df_emb_e5 = pd.DataFrame(
[ [model_e5.encode("query: " + word)] for word in words ],
index=words, columns=["embeddings"])
def compute_similarities(query_emb, df_emb):
similarities = {}
for word, word_emb in df_emb["embeddings"].items():
similarities[word] = cosine_similarity(
normalize([query_emb]),
normalize([word_emb]),
)[0][0]
return similarities
query = "cat"
query_emb = model_e5.encode("query: " + query)
similarities = compute_similarities(query_emb, df_emb_e5)
# => Word Similarity
# => cat ⇔ cat 1.0000
# => cat ⇔ mouse 0.8972
# => cat ⇔ house 0.8827
# => cat ⇔ clock 0.8717
The returned values are really close to each other, where I would intuitively expect a much bigger difference.
When I try the same with the sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
model, I'm getting a much bigger difference:
model_lm = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
df_emb_lm = pd.DataFrame(
[ [model_lm.encode(word)] for word in words ],
index=words, columns=["embeddings"])
query = "cat"
query_emb = model_lm.encode(query)
similarities = compute_similarities(query_emb, df_emb_lm)
# => Word Similarity
# => cat ⇔ cat 1.0000
# => cat ⇔ mouse 0.3129
# => cat ⇔ house 0.2636
# => cat ⇔ clock 0.0392
This corresponds much better with the intuitive understanding. I've tried using sklearn.preprocessing.normalize
or reducing the dimensions with PCA, but without much success. I have also used the passage:
prefix, again without much success.
I have found the Github issue, where you state:
For tasks like text retrieval or semantic similarity, what matters is the relative order of the scores instead of the absolute values, so this is generally not an issue.
However, I'm curious why the results differ so much between these two models. I would be very, very grateful for any advice or pointers how the cosine similarity computation could be improved.
When experimenting further, I've found that scaling the scores with sklearn.preprocessing.MinMaxScaler
makes the scores much more aligned with intuitive understanding of the differences. Does this approach make sense, in your opinion?
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
normalized_scores = \
scaler.fit_transform(
np.array(
list(similarities.values())
).reshape(-1, 1)).flatten()
normalized_similarities = { key: normalized_scores[i] for i, key in enumerate(similarities) }
# => Word Similarity
# => cat ⇔ cat 1.0000
# => cat ⇔ mouse 0.1981
# => cat ⇔ house 0.0857
# => cat ⇔ clock 0.0000
The underlying reason is that we use a much lower temperature (0.01
) for InfoNCE loss. For sentence-transformers models, they use 0.05
by default. Lower temperature will push the cosine scores to concentrate in a narrow range, but the ranking performance will improve based on our preliminary experiments.
Many thanks for the explanation, @intfloat ! In the end, I've decided to keep both the "original" and normalized similarities on the slide, to prevent any confusion. The output matches intuitive understanding quite well, I think:
Word Similarity
cat ⇔ cat 1.0000
cat ⇔ kitten 0.7378
cat ⇔ mouse 0.3186
cat ⇔ house 0.0781
cat ⇔ clock 0.0000