Cosine Similarity is very high with Japanese Sentences

#18
by Kizaburo - opened

Iā€™m using the modle like the code below.

'''
from sentence_transformers import SentenceTransformer
from torch import nn
import torch
import numpy as np

st = SentenceTransformer("xlm-roberta-base")
a = "ęœ€é«˜ć®å•†å“ć§ć™ć­ć€‚ä½•å›žć‚‚ćƒŖćƒ”ćƒ¼ćƒˆć—ć¾ć™ć€‚"
b = "ä½æ恄恄悉恄ļ¼äŗŒåŗ¦ćØč²·ć„ć¾ć›ć‚“ć€‚"
x = st.encode(a)
y = st.encode(b)
x = torch.from_numpy(x.astype(np.float32)).clone()
y = torch.from_numpy(y.astype(np.float32)).clone()
cos = nn.CosineSimilarity(dim=0, eps=1e-6)

print(cos(x, y))
'''

These sentences(a and b) are not similar,but the cosSimilarity Score shows 0.9980.
I'm very new to NLP,So I don't know how to reslove this problem.
Hope someone help me.

Since you are using SentenceTransformer trying asking https://github.com/UKPLab/sentence-transformers/issues
I used SentenceTransformer('paraphrase-multilingual-mpnet-base-v2') and got cosine_similarity of 0.43 for the 2 sentence in your example.

I am having a similar experience comparing Russian and English sentences. A pair of sentences which are completely different in meaning yield very high cosine similarity for this particular model (xlm-roberta-base) but as astyperand points out, using a different model, the approach as expected ie high cosine similarity for sentences with similar meaning and low for those with dissimilar meaning. I would like to know if this a reflection of the model or whether I am not using the model in the way intended.

Sign up or log in to comment