Fine-tuning for multiple tasks strategy
I would like to fine-tune this model on a specific set of images and combining 2 different tasks (used in cascade).
The idea is that once received the input image, the model should perform the image captioning task (MORE_DETAILED_CAPTION) to describe the image, and then use the CAPTION_TO_PHRASE_GROUNDING in order to have a 'visual perspective' of what the model has described (a sort of gradcam of the text).
What should I do in this case? Fine tune the model twice, starting from the image captioning task and then use the obtained model to train the model for the second task?
Same here, I am working on Chart Question Answering and would like to fine-tune this model on multitask (Visual Question Answering and Object Detection). Of course that I don't want to fine-tune the model twice.
Have you found a way to do that?
I am also looking for multi-task learning strategy. My understanding from the paper is that we need to design datasets for multi-task prediction from the single loss-based florence2.