add brief explanation of do_image_splitting to model card?

#3
by MoritzLaurer HF staff - opened
HuggingFaceM4 org

the sub-image splitting functionality for passing very large images is very cool! maybe you can make this feature a bit more explicit in the model card by saying somewhere that it's activated with do_image_splitting=True?
Do I understand correctly that this essentially allows users to pass images of any resolution?

Hi @MoritzLaurer , I opened a PR https://huggingface.co/HuggingFaceM4/idefics2-8b/discussions/4, I don't know if it's more clear for you

Essentially, do_image_splitting is set to True by default since it gives a big boost in OCR tasks, like TextVQA or DocVQA, or I imagine any tasks involving dealing with high-resolution images containing a lot of text.
We can however put it to False for the other tasks with no impact on performance.

The resolution of the images is always up to 980x980 (while keeping the aspect ratio), regardless of do_image_splitting being True or False.
do_image_splitting=True will artificially increasing the resolution of the image, by considering the 4 crops of it, and feeding each of them to the vision encoder + modality projection + perceiver resampler, as well as the original image itself.
Then, the output of this operation gives 4*64 + 64 = 320 image tokens that are all stacked together to encode the new image.
In contrast, do_image_splitting=False would just pass the original image to the vision encoder + modality projection + perceiver resampler, without the crops, resulting in only 64 visual tokens.

It's hard to go into such details in the model card while still providing something digestible for reading, I hope the paper will bring clarity on these specific points.

HuggingFaceM4 org

Thanks for the explanation @HugoLaurencon !

MoritzLaurer changed discussion status to closed

Sign up or log in to comment