Image-Text-to-Text
Transformers
Safetensors
English
idefics2
pretraining
multimodal
vision
Inference Endpoints
5 papers

How does the attention_mask contribute to the projector preformance?

#45
by lucasjin - opened

LLava deosn't have attention_mask to proj at all, either QwenVL etc.

Why you guys consider add attetnion mask to it?

Could you specify what you have in mind? You mean that the tokens for the images attend to each other (non-causal attention)?

@HugoLaurencon Hi, let me clarify a little bit.

From the idefics2 Perceiver implementation, I saw there a attention_mask as input added to the perciever forward, this params is not existed in idefics1 implmentation.

I reimplementation another Perceiver just copy your implement, but I can not found the input attention_mask should be, since I only get the vit output feature.

So I let it to be None, but the performance, far away from what I expected.

Could you tell me why?>

HuggingFaceM4 org

Ok I think I see, you're talking about the image_attention_mask then?

It's because we use the Navit strategy for the images: we don't resize them nor change the dimension (just reduce them if it's higher than 980 pixels).
As a result, each image has a different number of patches, as opposed to llava or qwen-vl which resizes all images to a fixed-size square image.
We then need to do padding to be able to pack several images together, and so you have an image_attention_mask

@HugoLaurencon Hi, yes, navit did this.

But the question is that, from idefics2 preprocess, I saw it uses 980x980 inputs, which means, you either resize it or pad it, if pad, how does it preprocessed raw images? what's the padding value?

And when sendign to the vit, how does the padding information sent in?

lucasjin changed discussion status to closed
lucasjin changed discussion status to open
HuggingFaceM4 org

It's in fact 980x980 maximum, but it's not always the case. It's only resized if we have an image with a side > 980 (or < than a minimum side, and in this scale we upscale the image);
You can have a look at https://github.com/huggingface/transformers/blob/main/src/transformers/models/idefics2/image_processing_idefics2.py to see how the processing is done. For the padding you could use an empty image, for example, or just black pixels.

Hi, I noticed that you guys moved the MLP ahead of Perceiver, which makes the feature into perceiver more dense in last dimension (same as LLM).

Have u guys experiement does the MLP ahead of perceiver, or after, which should be a better way?

I found some models mostly make the MLP after the perceiver or Resampler.

HuggingFaceM4 org

Yes exactly it was done on purpose and found to work a bit better

I experimented move MLP ahead, and deeper the Perceiver, make it 6, mostly like idefics1's fashion. I found if using MLP after, the loss hardly to converge, and ahead, deeper perceiver, the training smoother, I have already tuned the lr. And the result deeper perceiver much more better.

Do you guys have any thought on this?

HuggingFaceM4 org

No for us with a deeper perceiver we obtained the same scores

How u guys conducted comparsion between Perceiver and Resampler? Which one is better?

HuggingFaceM4 org

hi @lucasjin
can you expand on "Perceiver and Resampler"?
we just call it a perceiver resampler, i.e. to us, it is the same thing

Hahaha ,sorry for the missleading, from my perspective, the Perceiver, of course, idefics2's fashion which simillar to flamingo.

However, Resampler, there is another way, represented by QwenVL and MiniCPM, they simply using a single layer attention (without autoregressive).

The later one, simpler, but I don't know if it is the most effect as well or not.

Sign up or log in to comment