Collections of multimodal (image+text) instruction finetuning datasets tailored for visual language models like LlaVA, Fuyu, or IDEFICS.