AnyModal/VLM_Cartoon_Caption
Updated
The current demos in AnyModal are for visual+text tasks. We plan to add a few demos for other modalities like audio as well in the future. Our goal is to make it easy for anyone to create multimodal LLMs using any input modality tokenizer + LLM combination (hence the name AnyModal)!
Looks great! I am currently working on simplifying the training/fine-tuning multimodal LLMs in Torch: https://github.com/ritabratamaiti/AnyModal