add brief explanation of do_image_splitting to model card?
the sub-image splitting functionality for passing very large images is very cool! maybe you can make this feature a bit more explicit in the model card by saying somewhere that it's activated with do_image_splitting=True
?
Do I understand correctly that this essentially allows users to pass images of any resolution?
Hi @MoritzLaurer , I opened a PR https://huggingface.co/HuggingFaceM4/idefics2-8b/discussions/4, I don't know if it's more clear for you
Essentially, do_image_splitting
is set to True
by default since it gives a big boost in OCR tasks, like TextVQA or DocVQA, or I imagine any tasks involving dealing with high-resolution images containing a lot of text.
We can however put it to False
for the other tasks with no impact on performance.
The resolution of the images is always up to 980x980 (while keeping the aspect ratio), regardless of do_image_splitting
being True
or False
.do_image_splitting=True
will artificially increasing the resolution of the image, by considering the 4 crops of it, and feeding each of them to the vision encoder + modality projection + perceiver resampler, as well as the original image itself.
Then, the output of this operation gives 4*64 + 64 = 320 image tokens that are all stacked together to encode the new image.
In contrast, do_image_splitting=False
would just pass the original image to the vision encoder + modality projection + perceiver resampler, without the crops, resulting in only 64 visual tokens.
It's hard to go into such details in the model card while still providing something digestible for reading, I hope the paper will bring clarity on these specific points.
Thanks for the explanation @HugoLaurencon !