CLIP Performance

#8
by marcusinthesky - opened

CLIP model show incredible zero-shot generalization across numerous image classification and retrieval tasks. I wonder if anyone has managed to perform these benchmarks on CLIP models (the text encoder part) to determine how well the CLIP training paradigm does on common text-embedding tasks like STS.

marcusinthesky changed discussion title from CLIP to CLIP Performance

@RunwayMarcus Do you think we can add the model to the ladderboard ? it doesn't seem to be that difficult to me, the model is open source https://github.com/openai/CLIP

@RunwayMarcus I did some test with the CLIP model and it didn't give good performance ... better use sentence transformer embeddings (CLIP gave low score on all the metric on the ladderboard).

From a research standpoint that is very interesting. I wonder if that's cause of the CLIP token-length, or if it's isolated to QA and Retrieval tasks, cause I think the overwhelming argument being made at the moment is that Multi-modality is crucial to unlocking unimodal task performance.

I think there is a need for a benchmark to evaluate image retrieval from natural text. and more generally for multimodal retrieval and embeddings.
but I think it should have a dedicated leaderboard.

Massive Text Embedding Benchmark org

We are working on that, stay tuned! cc @gowitheflow

Sign up or log in to comment