Softmax of similarity scores?

#10
by vishal1278 - opened

In the code example that's provided on the Model Card page, the logits_per_image variable is referred to as the image-text similarity score. If these are the text-image similarity scores, then how would applying the softmax()function on these similarity scores make any sense?

Consider the possibility where we have an image that contains both a cat and a dog. And we use the following text:

text = ['a photo of a cat', 'a photo of a dog']

Now, the logits_per_image variable would (or should) return high values for both text prompts. Let's assume that both scores are similar. If we apply the softmax() function on those two similarity scores, then the final probability scores -- aka probs -- would be around 50% for both. This wouldn't make sense, because we would expect both prob scores to be high.

In the Model Card example, the two text prompts are chosen in such a way that only one of them is correct for the sample image. And it makes sense that after applying softmax, the prob score for cats is substantially higher than the prob score for dogs. But this wouldn't work if we are trying to find a bunch of objects (such as shoe, dress, glasses) from an image. Those probabilities would not sum to 1.0. Hence, applying softmax() on these scores would not make sense.

Sign up or log in to comment