Clarification of model outputs
Dear authors,
Can you please clarify if the order of scores when given a positive and a negative prompt is [negative_prob, positive_prob]
or [positive_prob, negative_prob]
?
Also, can you please expand on the similarity of an image and caption output['sim']
? Does a low value mean that the caption and image are similar? What values should I expect? I would like to understand if the values will be bounded and/or negative or positive.
Many thanks,
George
Hi George,
For zero shot it is [negative_prob, positive_prob]
. You can see the code here: https://huggingface.co/paige-ai/Prism/blob/main/modeling_prism.py#L309-L310.
For the meaning of the similarity matrix, please refer to CLIP paper and the method of contrastive learning in general. In short, these are dot products between image and language features, scaled by temperature.
Best,
George
Hi George,
Thank you for the prompt reply and pointer to the modelling code!
So, for two images and one caption, does it mean that the image with the higher similarity to the caption (bigger temperature-scaled dot product) aligns better with the caption?
Best,
George
Yes, and you can softmax these scores to get probabilities. This is what contrastive objective does - softmax and then cross-entropy loss on the prob scores (detailes in CLIP paper).
I only considered using Softmax on positive vs. negative prompts for a single image. But for multiple images and a single caption, we model which of the images came from the caption. I'll check the CLIP paper, but is my intuition correct?
Thanks! It all makes sense now.