Embeddings not working with Langchain HuggingFaceEmbeddings

#11
by weissenbacherpwc - opened

Hi,

I use the embeddings locally and found that with the Jina Embeddings, the retriever is not working correctly.
I set up my retriever as follows:

retriever = vectordb.as_retriever(search_kwargs={'k': 3, 'score_threshold': 0.75,'sorted': True}, search_type="similarity_score_threshold")

This is computing cosine similarity with all other embedding models I am using. However with jina-embeddings-v2-base-de I am only getting values returned if I set the threshold to a low value like 0.2. Sometimes the scores are even negative. Is maybe cosine distance?

Here my code for replication:
`loader = DirectoryLoader(directory_path,
glob='*.pdf',
loader_cls=PyPDFLoader)
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
#length_function = len
)
texts = text_splitter.split_documents(documents)
embeddings = HuggingFaceEmbeddings(model_name=embedding_name,
model_kwargs={'device': 'mps', 'trust_remote_code': True},
encode_kwargs={'device': 'mps', 'normalize_embeddings': True})
vectorstore = FAISS.from_documents(texts, embeddings)
retriever = vectordb.as_retriever(search_kwargs={'k': 3, 'score_threshold': 0.75,'sorted': True}, search_type="similarity_score_threshold")`

Jina AI org
edited Mar 4

Hi @weissenbacherpwc

I tried to replicate your problem of negative similarity scores using our embedding model but did not succeed. However, it seems that the embeddings are not in fact normalized when I run your above code. Here is a small working example to illustrate:

import numpy as np
from langchain_community.embeddings import HuggingFaceEmbeddings

texts = [
    'How is the weather today?',
    'Wie ist das Wetter heute?',
    'Wie geht es dir?',
    'How are you doing?'
]

model = HuggingFaceEmbeddings(model_name="jinaai/jina-embeddings-v2-base-de", model_kwargs={'device': 'mps', 'trust_remote_code': True}, encode_kwargs={'device': 'mps', 'normalize_embeddings': True})
embeddings = model.client.encode(texts)

# Compute the L2 norm of each embedding vector
l2_norms = np.linalg.norm(embeddings, axis=1)

print(l2_norms)
# To check all embeddings systematically, you can use:
are_normalized = np.allclose(l2_norms, 1, atol=1e-6)
print(f"All embeddings normalized: {are_normalized}")  # --> this prints False

This shows that the embeddings are not normalized, even though we set the normalize_embeddings=True in the encode_kwargs. I suspect these kwargs may not be passed along correctly. If I change the encode call to the following, the embeddings are actually normalized:

embeddings = model.client.encode(texts, normalize_embeddings=True)

So it seems the parameter is not passed correctly with encode_kwargs. I will do some more debugging but as far as I can see, this is not an issue on our embedding model side, but rather with the langchain pipeline.

Sign up or log in to comment