Reproducing paper results

#7
by j4kn - opened

In Figure 4 of the BLIP-2 paper, a selection of examples is given by the authors with the following legend: "Selected examples of instructed zero-shot image-to-text generation using a BLIP-2 model w/ ViT-G and FlanT5 XXL (...)".

So far, I could not reproduce those selected examples with the HF demo. I understand the model served is the right one (BLIP-2 ViT-G + FlanT5 XXL), but despite playing the different decoding strategies and parameters, I am not able to reproduce those selected examples with the corresponding images.

For instance, considering this image:
Screenshot 2023-02-13 at 7.28.55 PM.png

The paper reports a long and informative answer from the model, while the actual running model in the demo only produces short texts that are less informative. There are small discrepancies as well, like the answers starting with capital in the paper but not in the demo. Adding prompts like "Question: {} Answer: {}", or "Question: {} Long answer: {}" does not change the outcome.

Am I missing something?

Screenshot 2023-02-13 at 7.31.02 PM.png

Screenshot 2023-02-13 at 7.31.45 PM.png

The currently running demo backend has max_txt_len=30. The example shown we have a larger max_txt_len=100.

The demo does not allow change max txt length as for now. We ll rework to include this option.

That makes sense. Thanks a lot for the feedback.

Let me close this thread.

j4kn changed discussion status to closed

Sign up or log in to comment