Showcase sentence similarity or another downstream task

#1
by osanseviero HF staff - opened

This is cool! Maybe you could do something like sentence similarity? E.g. allow users to write 3-4 inputs and then compute distances. That might be easier to understand and test out than outputting the embeddings. See the widget in https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 as an example

osanseviero changed discussion title from Sentence similarity to Showcase sentence similarity or another downstream task

That would be cool!
Another interesting idea might be to take a corpus of sentences (e.g. 10k texts from quora questions), embed them all and store them in a HF dataset. Then, you can perform semantic search on the embedding from the user & the corpus of embeddings, and show the user which e.g. 5 sentences are most similar to the input text according to the embedding model.

What do you think?

Owner

i like both the pre loading and the sentence similarity ideas, so i'll do both and add them as fn here :-) thanks for the great ideas folks , help is more than welcome ๐Ÿค—

currently i'm building connectors for this end point in some cool rag app here: https://github.com/tonic-ai/scitonic , so i think i need a connector for qdrant and autogen ... hmm this demo seems it will be done before that app is done though xD

Owner

ok , @osanseviero 's issue is pushed + "resolved" (maybe a bit of formatting might do something nice + examples) :-)

@tomaarsen , i really want to start on your issue now, so i want / was going to try to use chroma in memory as a neat addition, but the coolest would be if we could get the huggingface dataloader and maybe oauth somehow ... do you have time, thoughts or pointers about this ? i wanna do a good job here , thats' why , maybe i should work on batch processing...

It depends on how big you want to make it - but I would personally just take 10k texts, turn them into embeddings & just keep them in memory as a (10000, 4096) torch tensor. Then you can use semantic_search from Sentence Transformers, which directly works on embedding tensors.

I'm not 100% sure, but I think that'll be sufficiently fast.

Locally I can embed a sentence & get the top 5 embeddings out of a corpus of 100k embeddings (that are all just in memory) in 0.091s seconds on CPU. I don't think you strictly have to use vector databases for this example. I used this example: https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/semantic-search/semantic_search_quora_pytorch.py

Owner

hey thank you for this , now i really want to make loaders for like quora and more, check out the current one : json loader U_U ... but it's useful if you build with it , which is nice ๐Ÿค—

any more tips ideas , sources, or cool datasets you know we could show off with this one ? the embeddings refer to some types of things, so now i want to make an "intention mapper" that will select the correct type of embedding based on the user query or something (not for each text chunk... although... :-) ) ๐Ÿš€

Owner

hey @tomaarsen , i'm starting the task on the preloading quora questions , turns out i could find all the datasets e5 was trained on, so i will process them to fuse some of the answer's parts , then i will serve them each in a tab, meaning folks can try to query without preloading at all . that's a cool demo :-)

Sign up or log in to comment