Image-Text-to-Text
Transformers
Safetensors
English
idefics2
pretraining
multimodal
vision
Inference Endpoints
5 papers

OCR on dense images by increasing resampler_n_latents

#22
by tonibert - opened

First of all great work with idefics2! I love the data mixture and the architecture, especially the perceiver is a great idea!
I have a use case with dense and large screenshots from websites and wanted to use idefics for this. Some websites hold up to 2k tokens in text information and the model needs to be able to ocr all of them with coordinates.
I want to do a full finetune on a idefics model with a custom configuration and a 70B llm decoder. To save on VRAM my idea was to disable image splitting, but increase the resampler_n_latents to something like 1K to be able to transport the necessary amount of information to the decoder.
In the code I saw the comment that the learnable queries are normally <128 and wanted to ask if you by chance ran any ablations on this or if there is a general problem with a higher dim?

HuggingFaceM4 org

Hi @tonibert ,

glad you are considering idefics2!

I have a use case with dense and large screenshots from websites and wanted to use idefics for this. Some websites hold up to 2k tokens in text information and the model needs to be able to ocr all of them with coordinates.
I want to do a full finetune on a idefics model with a custom configuration and a 70B llm decoder. To save on VRAM my idea was to disable image splitting, but increase the resampler_n_latents to something like 1K to be able to transport the necessary amount of information to the decoder.

1K latents for 2K tokens to generate does sound like aggressive and a lot. That would mean 1 vector of (let's say) 768 size for 2 tokens. Knowing very little about your use-case, that sounds unnecessarily generous 🤷‍♂️
Also note that if you want to replace the LLM by a bigger one is not equivalent to fine-tuning idefics2 since you would loose the multimodal pre-training.

In the code I saw the comment that the learnable queries are normally <128 and wanted to ask if you by chance ran any ablations on this or if there is a general problem with a higher dim?

Nope, using more latents is totally possible. at this point, it is an experimental question (depending on your use-case) how many latents are required.

Great, thank you for the insight!
My intuition was that as we project the latents into the space of the language model we would be forced to come closer to the 1 token per vector information density of the decoder, but you are totally right, if I finetune enough the decoder should be able to pick that up!
Yes, I will rerun at least part of the multimodal pretraining and only keep the pretrained encoder and maybe part of the perceiver and MLP with some weight replication.
Thanks for the quick reply:)

HuggingFaceM4 org

for sure!
i'll close that discussion. feel free to reopen if necessary!

VictorSanh changed discussion status to closed

Sign up or log in to comment