Fine-tuning CLIP model for image-image search

#23
by AFRF - opened

Hi all, I've been working on image-image search tasks and CLIP has work really well for me, currently I want to take the performance of my approach further and I was thinking in fine tuning the CLIP model for this task. For this, I'm just generating the embeddings of the images, store them in a vector index and the just computing the cosine similarity between the embedding of my search image and all the embeddings in the vector index. Im not really using any zero-shot application or image-text comparison and I've seen all the fine-tuning approaches for CLIP models I read use text-image pairs for the fine tuning, I don't understand how I should fine tune the model to increase the performance of my application, should I use text-image pairs? Or should I only fine tune the visual encoder of the model, and if thats the case anyone has some examples of how can i do it?

Why use you this Model?

Hey AFRF, can you tell me how i can use this model to compare the image and a text(sentence) similarity rather than providing the a bunch of classes

Sign up or log in to comment