patrickjohncyh/fashion-clip · Fine Tuning fashionCLIP for image-image search

AFRF

Nov 29, 2023

Hi all, I've been working on image-image search tasks and fashionCLIP has work really well for me, currently I want to take the performance of my approach further and I was thinking in fine tuning the fashionCLIP model for this task. For this, I'm just generating the embeddings of the images, store them in a vector index and the just computing the cosine similarity between the embedding of my search image and all the embeddings in the vector index. Im not really using any zero-shot application or image-text comparison and I've seen all the fine-tuning approaches for CLIP models I read use text-image pairs for the fine tuning, I don't understand how I should fine tune the model to increase the performance of my application, should I use text-image pairs? Or should I only fine tune the visual encoder of the model, and if thats the case anyone has some examples of how can i do it?

jamie0725

Dec 5, 2023

if you are only doing image search, why don't you just use image transformer models without text encoders?

txhno

May 29, 2024

Hi all, I've been working on image-image search tasks and fashionCLIP has work really well for me, currently I want to take the performance of my approach further and I was thinking in fine tuning the fashionCLIP model for this task. For this, I'm just generating the embeddings of the images, store them in a vector index and the just computing the cosine similarity between the embedding of my search image and all the embeddings in the vector index. Im not really using any zero-shot application or image-text comparison and I've seen all the fine-tuning approaches for CLIP models I read use text-image pairs for the fine tuning, I don't understand how I should fine tune the model to increase the performance of my application, should I use text-image pairs? Or should I only fine tune the visual encoder of the model, and if thats the case anyone has some examples of how can i do it?

Did you find another approach? I too am working on Image to Image search on fashion items.