LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks Paper • 2410.01744 • Published Oct 2 • 25
TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models Paper • 2109.10282 • Published Sep 21, 2021 • 6
Medical Multimodal Datasets Collection Datasets that can be used to train and/or evaluate medical multimodal models. • 3 items • Updated Dec 9, 2023 • 2
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation Paper • 2405.01434 • Published May 2 • 52
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model Paper • 2402.07827 • Published Feb 12 • 45
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning Paper • 2309.15091 • Published Sep 26, 2023 • 32