Spaces:

Tonic
/

e5

Runtime error

Showcase sentence similarity or another downstream task

by osanseviero - opened Jan 16, 2024

Jan 16, 2024

This is cool! Maybe you could do something like sentence similarity? E.g. allow users to write 3-4 inputs and then compute distances. That might be easier to understand and test out than outputting the embeddings. See the widget in https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 as an example

osanseviero changed discussion title from Sentence similarity to Showcase sentence similarity or another downstream task Jan 16, 2024

tomaarsen

Jan 17, 2024

That would be cool!
Another interesting idea might be to take a corpus of sentences (e.g. 10k texts from quora questions), embed them all and store them in a HF dataset. Then, you can perform semantic search on the embedding from the user & the corpus of embeddings, and show the user which e.g. 5 sentences are most similar to the input text according to the embedding model.

What do you think?

Tonic

Owner Jan 17, 2024

i like both the pre loading and the sentence similarity ideas, so i'll do both and add them as fn here :-) thanks for the great ideas folks , help is more than welcome 🤗

currently i'm building connectors for this end point in some cool rag app here: https://github.com/tonic-ai/scitonic , so i think i need a connector for qdrant and autogen ... hmm this demo seems it will be done before that app is done though xD

Tonic

Owner Jan 18, 2024

ok , @osanseviero 's issue is pushed + "resolved" (maybe a bit of formatting might do something nice + examples) :-)

@tomaarsen , i really want to start on your issue now, so i want / was going to try to use chroma in memory as a neat addition, but the coolest would be if we could get the huggingface dataloader and maybe oauth somehow ... do you have time, thoughts or pointers about this ? i wanna do a good job here , thats' why , maybe i should work on batch processing...

tomaarsen

Jan 18, 2024

It depends on how big you want to make it - but I would personally just take 10k texts, turn them into embeddings & just keep them in memory as a (10000, 4096) torch tensor. Then you can use semantic_search from Sentence Transformers, which directly works on embedding tensors.

I'm not 100% sure, but I think that'll be sufficiently fast.

tomaarsen

Jan 18, 2024

Locally I can embed a sentence & get the top 5 embeddings out of a corpus of 100k embeddings (that are all just in memory) in 0.091s seconds on CPU. I don't think you strictly have to use vector databases for this example. I used this example: https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/semantic-search/semantic_search_quora_pytorch.py

Tonic

Owner Jan 18, 2024

hey thank you for this , now i really want to make loaders for like quora and more, check out the current one : json loader U_U ... but it's useful if you build with it , which is nice 🤗

any more tips ideas , sources, or cool datasets you know we could show off with this one ? the embeddings refer to some types of things, so now i want to make an "intention mapper" that will select the correct type of embedding based on the user query or something (not for each text chunk... although... :-) ) 🚀

Tonic

Owner Jan 21, 2024

hey @tomaarsen , i'm starting the task on the preloading quora questions , turns out i could find all the datasets e5 was trained on, so i will process them to fuse some of the answer's parts , then i will serve them each in a tab, meaning folks can try to query without preloading at all . that's a cool demo :-)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment