Saving embeddings

#1
by Lue-C - opened

Hi there,

I was trying the model using below code

embeddings = SentenceTransformerEmbeddings(cache_folder=embedding_folder)
chroma_client = chromadb.PersistentClient(path=db_path)
db = Chroma(persist_directory=db_path, embedding_function=embeddings, client=chroma_client)
results_scores = db.similarity_search_with_relevance_scores(query, k=5)

but I get negative scores like -9.47 for the similarity search. The vector db was created beforehand and stored at db_path. The model was downloaded and stored at embedding_folder.
I presume that the negative score is due to different normalizations, but I am not sure about this. Because of this issue I tried the procedure described in the model card and it works fine. But not onyl the scores are like expected, also the ordering of the results was different too. I could use this but this would mean to calculate the vector embeddings for my document chunks (there are many of them) to be calculated on the fly. This is not feasible in my case.

Is there a way to calculate the vector embeddings of my text chunks, store them and calculate the similarity afterwards?
Or is there something I am missing in the code above?

Regards

Dell Research Harvard org

Hi!

Thanks for engaging with our model!
First of all, it is absolutely possible to have an offline index and use that to do similarity search. With respect to your actual code , I have never used Chroma for any of our workflows and hence the API is a bit unfamiliar to me. However, from a cursory look, it appears that the default similarity metric could be Euclidean. That can be negative.

Please think of this model as a sentence transformer model, so it can be used in any similarity search frameworks. We prefer FAISS (which forms the backbone of most high-level packages for vector search). I would encourage you to check out the library and migrate your code to it if you prefer it : https://github.com/facebookresearch/faiss/tree/main. The switch is not very costly. In fact, our package https://linktransformer.github.io/ wraps around SentenceTransformers and FAISS to provide a dataframe-based approach to similarity search (but on the fly - not your use-case). FAISS supports both gpu and CPU search - so do shop around to look at the best possible solution for you.

In case you do want to switch to the framework we work with :

Just to give you a sense of how this would work (please treat this as pseudocode - not guaranteed to work off the shelf as I didn't run it to check), so tinker around as needed.
Note the use of IndexFlatIP - (or inner product). After normalization of embeddings, the Inner product reduces to cosine similarity scores.
'Flat' means exact search. FAISS also has approximate search indices, so do feel free to experiment. This would need modification to do GPU search (refer to faiss docs)

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

# Initialize the model
model = SentenceTransformer('dell-research-harvard/lt-wikidata-comp-multi')

# Sample corpus
corpus = ["Apple Inc.", "Google", "Microsoft"]

# Encode the corpus
corpus_embeddings = model.encode(corpus)

# Normalize the embeddings - to ensure that IP->cosine sim which is bounded to 1
faiss.normalize_L2(corpus_embeddings)

# Dimension of our vectors
d = corpus_embeddings.shape[1]

# Creating a FAISS index for inner product
index = faiss.IndexFlatIP(d)  # Use IndexFlatIP to search with inner product

# Adding normalized corpus embeddings to the index
index.add(corpus_embeddings)

# Save the index
faiss.write_index(index, 'corpus.index') ###You would need to do this to save embeddings . You wouldn't need this if you are doing it on the fly

You now need to load the index and query stuff

# Load the index
index = faiss.read_index('corpus.index')

# Sample query
query = "Apple" ##Could be a list of queries

# Encode the query
query_embedding = model.encode([query])

# Normalize the query embedding
faiss.normalize_L2(query_embedding)

# Search in the index using inner product
k = 2  # Number of top records to retrieve
distances, indices = index.search(query_embedding, k)

# Print the results - or use them as needed
print("Query:", query)
print("Top results from the corpus:")
for idx, (distance, index) in enumerate(zip(distances[0], indices[0])):
    print(f"{idx + 1}: {corpus[index]} (Similarity: {distance})")
96abhishekarora changed discussion status to closed

Sign up or log in to comment