Clarification of model outputs

#3
by GeorgeBatch - opened

Dear authors,

Can you please clarify if the order of scores when given a positive and a negative prompt is [negative_prob, positive_prob] or [positive_prob, negative_prob]?

Also, can you please expand on the similarity of an image and caption output['sim']? Does a low value mean that the caption and image are similar? What values should I expect? I would like to understand if the values will be bounded and/or negative or positive.

Many thanks,
George

Hi George,

For zero shot it is [negative_prob, positive_prob]. You can see the code here: https://huggingface.co/paige-ai/Prism/blob/main/modeling_prism.py#L309-L310.

For the meaning of the similarity matrix, please refer to CLIP paper and the method of contrastive learning in general. In short, these are dot products between image and language features, scaled by temperature.

Best,
George

Hi George,

Thank you for the prompt reply and pointer to the modelling code!

So, for two images and one caption, does it mean that the image with the higher similarity to the caption (bigger temperature-scaled dot product) aligns better with the caption?

Best,
George

Yes, and you can softmax these scores to get probabilities. This is what contrastive objective does - softmax and then cross-entropy loss on the prob scores (detailes in CLIP paper).

I only considered using Softmax on positive vs. negative prompts for a single image. But for multiple images and a single caption, we model which of the images came from the caption. I'll check the CLIP paper, but is my intuition correct?

From the CLIP paper you can see that the CE loss is applied over images per label and over labels per image, and then averaged. So label to images works the same way as image to labels.
image.png

Thanks! It all makes sense now.

GeorgeBatch changed discussion status to closed

Sign up or log in to comment