Embeddings not working with Langchain HuggingFaceEmbeddings
Hi,
I use the embeddings locally and found that with the Jina Embeddings, the retriever is not working correctly.
I set up my retriever as follows:
retriever = vectordb.as_retriever(search_kwargs={'k': 3, 'score_threshold': 0.75,'sorted': True}, search_type="similarity_score_threshold")
This is computing cosine similarity with all other embedding models I am using. However with jina-embeddings-v2-base-de I am only getting values returned if I set the threshold to a low value like 0.2. Sometimes the scores are even negative. Is maybe cosine distance?
Here my code for replication:
`loader = DirectoryLoader(directory_path,
glob='*.pdf',
loader_cls=PyPDFLoader)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
#length_function = len
)
texts = text_splitter.split_documents(documents)
embeddings = HuggingFaceEmbeddings(model_name=embedding_name,
model_kwargs={'device': 'mps', 'trust_remote_code': True},
encode_kwargs={'device': 'mps', 'normalize_embeddings': True})
vectorstore = FAISS.from_documents(texts, embeddings)
retriever = vectordb.as_retriever(search_kwargs={'k': 3, 'score_threshold': 0.75,'sorted': True}, search_type="similarity_score_threshold")`
I tried to replicate your problem of negative similarity scores using our embedding model but did not succeed. However, it seems that the embeddings are not in fact normalized when I run your above code. Here is a small working example to illustrate:
import numpy as np
from langchain_community.embeddings import HuggingFaceEmbeddings
texts = [
'How is the weather today?',
'Wie ist das Wetter heute?',
'Wie geht es dir?',
'How are you doing?'
]
model = HuggingFaceEmbeddings(model_name="jinaai/jina-embeddings-v2-base-de", model_kwargs={'device': 'mps', 'trust_remote_code': True}, encode_kwargs={'device': 'mps', 'normalize_embeddings': True})
embeddings = model.client.encode(texts)
# Compute the L2 norm of each embedding vector
l2_norms = np.linalg.norm(embeddings, axis=1)
print(l2_norms)
# To check all embeddings systematically, you can use:
are_normalized = np.allclose(l2_norms, 1, atol=1e-6)
print(f"All embeddings normalized: {are_normalized}") # --> this prints False
This shows that the embeddings are not normalized, even though we set the normalize_embeddings=True
in the encode_kwargs
. I suspect these kwargs may not be passed along correctly. If I change the encode call to the following, the embeddings are actually normalized:
embeddings = model.client.encode(texts, normalize_embeddings=True)
So it seems the parameter is not passed correctly with encode_kwargs
. I will do some more debugging but as far as I can see, this is not an issue on our embedding model side, but rather with the langchain pipeline.