Semantic Search

#14
by dilolo - opened

First of all, I want to thank you for such a great product.
I want to use your model for semantic search inside text documents. I have 50,000 text documents and I want to use your model to find the document that best matches the query.
My search led me to Elasticsearch
Is it possible to integrate Elasticsearch into your model so that matching sentences for a query are found among these documents?

Hi @dilolo ,

Recent versions of Elasticsearch support embedding based retrieval. All you need to do is store the embeddings generated from multilingual-e5-* into Elasticsearch index and search against them.

You might be interested in this article: https://www.elastic.co/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0

How to finetuning multilingual-e5 on my dataset? I have triples format : anchor, positive and negative.

@dilolo , also check this article (and notebook) from Elastic, which has a full implementation, using the E5 model: https://www.elastic.co/search-labs/blog/articles/multilingual-vector-search-e5-embedding-model

Note that the article uses an internal instance of the model, ie. directly in the cluster, and an inference pipeline to create embeddings automatically when indexing documents. This is not strictly neccessary: you can create embeddings before you index the document, and pass the payload to Elasticsearch. Here's an example implementation: https://nb.karmi.cz/semantic-search-with-elasticsearch/#Indexing-the-Data. For search, you would use the knn search type, as @intfloat suggests.

Also note that you need to split your documents into chunks, so they fit into the model context window. There are high-level libraries like LlamaIndex which provide components for that, see eg. https://docs.llamaindex.ai/en/stable/optimizing/basic_strategies/basic_strategies.html#chunk-sizes

@karmiq Thank you very much

How to finetuning multilingual-e5 on my dataset? I have triples format : anchor, positive and negative.

I didn't do fine tuning. I just used this model to vectorize my documents

Sign up or log in to comment