CLIP Performance

#8
by marcusinthesky - opened

CLIP model show incredible zero-shot generalization across numerous image classification and retrieval tasks. I wonder if anyone has managed to perform these benchmarks on CLIP models (the text encoder part) to determine how well the CLIP training paradigm does on common text-embedding tasks like STS.

marcusinthesky changed discussion title from CLIP to CLIP Performance

@RunwayMarcus Do you think we can add the model to the ladderboard ? it doesn't seem to be that difficult to me, the model is open source https://github.com/openai/CLIP

@RunwayMarcus I did some test with the CLIP model and it didn't give good performance ... better use sentence transformer embeddings (CLIP gave low score on all the metric on the ladderboard).

From a research standpoint that is very interesting. I wonder if that's cause of the CLIP token-length, or if it's isolated to QA and Retrieval tasks, cause I think the overwhelming argument being made at the moment is that Multi-modality is crucial to unlocking unimodal task performance.

Sign up or log in to comment