Confidence scores for image captioning?

#13
by acmidev - opened

Hi there,

I was wondering how to generate confidence scores when generating image captions with the sample code.

Best, Simon.

from PIL import Image
import requests
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16
)
model.to(device)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt").to(device, torch.float16)

generated_ids = model.generate(**inputs)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(generated_text)
two cats laying on a couch

Hi,

You can obtain a confidence score by passing output_scores=True and return_dict_in_generate=Trueto the generate() method.

outputs = model.generate(**inputs, output_scores=True, return_dict=_in_generate=True)
scores = outputs.scores

According to the docs:

In case of greedy decoding; this contains the processed prediction scores of the language modeling head (scores for each vocabulary token before SoftMax) at each generation step. Tuple of torch.FloatTensor with up to max_new_tokens elements (one element for each generated token), with each tensor of shape (batch_size, config.vocab_size).

To calculate a probability for the entire sequence, you could do the following:

# get probability for each generated token
topks = [s.softmax(-1).topk(1) for s in output.scores] 

probs = []
for tk in topks:
    probs.append(tk.values.view(-1)[0].item())

# multiply probabilities
sequence_prob = torch.tensor(probs).prod()

@nielsr Thanks so much for the code - so adding it back to the sample code via this Google Colab the output is:

Prediction: two cats laying on a couch
Confidence: 0.012353635393083096

So does that suggest the confidence level is 1.2%?

@nielsr Thanks so much for the code - so adding it back to the sample code via this Google Colab the output is:

Prediction: two cats laying on a couch
Confidence: 0.012353635393083096

So does that suggest the confidence level is 1.2%?

Have you solved this problem? I obtained the same result using the code provided above. If you have solved this problem, I hope you can share your correct code with me. Thank you very much! Good luck to you!

@shams123321 I haven't heard back from @nielsr about it yet, and we haven't resolved it unfortunately.

Sign up or log in to comment