How does the attention_mask contribute to the projector preformance?

From the idefics2 Perceiver implementation, I saw there a attention_mask as input added to the perciever forward, this params is not existed in idefics1 implmentation.

I reimplementation another Perceiver just copy your implement, but I can not found the input attention_mask should be, since I only get the vit output feature.

So I let it to be None, but the performance, far away from what I expected.

Could you tell me why?>

HugoLaurencon

May 15, 2024

Ok I think I see, you're talking about the image_attention_mask then?

It's because we use the Navit strategy for the images: we don't resize them nor change the dimension (just reduce them if it's higher than 980 pixels).
As a result, each image has a different number of patches, as opposed to llava or qwen-vl which resizes all images to a fixed-size square image.
We then need to do padding to be able to pack several images together, and so you have an image_attention_mask

lucasjin

May 20, 2024

@HugoLaurencon Hi, yes, navit did this.

But the question is that, from idefics2 preprocess, I saw it uses 980x980 inputs, which means, you either resize it or pad it, if pad, how does it preprocessed raw images? what's the padding value?

And when sendign to the vit, how does the padding information sent in?

lucasjin changed discussion status to closed May 20, 2024

lucasjin changed discussion status to open May 20, 2024

HugoLaurencon

May 21, 2024

It's in fact 980x980 maximum, but it's not always the case. It's only resized if we have an image with a side > 980 (or < than a minimum side, and in this scale we upscale the image);
You can have a look at https://github.com/huggingface/transformers/blob/main/src/transformers/models/idefics2/image_processing_idefics2.py to see how the processing is done. For the padding you could use an empty image, for example, or just black pixels.

lucasjin

May 22, 2024

Hi, I noticed that you guys moved the MLP ahead of Perceiver, which makes the feature into perceiver more dense in last dimension (same as LLM).

Have u guys experiement does the MLP ahead of perceiver, or after, which should be a better way?

I found some models mostly make the MLP after the perceiver or Resampler.

HugoLaurencon

May 22, 2024

Yes exactly it was done on purpose and found to work a bit better

lucasjin

May 23, 2024

I experimented move MLP ahead, and deeper the Perceiver, make it 6, mostly like idefics1's fashion. I found if using MLP after, the loss hardly to converge, and ahead, deeper perceiver, the training smoother, I have already tuned the lr. And the result deeper perceiver much more better.

Do you guys have any thought on this?

HugoLaurencon

May 27, 2024

No for us with a deeper perceiver we obtained the same scores

lucasjin

May 30, 2024

How u guys conducted comparsion between Perceiver and Resampler? Which one is better?

VictorSanh

May 30, 2024

hi @lucasjin
can you expand on "Perceiver and Resampler"?
we just call it a perceiver resampler, i.e. to us, it is the same thing

lucasjin

May 31, 2024

Hahaha ,sorry for the missleading, from my perspective, the Perceiver, of course, idefics2's fashion which simillar to flamingo.

However, Resampler, there is another way, represented by QwenVL and MiniCPM, they simply using a single layer attention (without autoregressive).

The later one, simpler, but I don't know if it is the most effect as well or not.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment