-
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
Paper • 2403.12596 • Published • 10 -
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Paper • 2404.13013 • Published • 32 -
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Paper • 2404.16994 • Published • 37 -
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability
Paper • 2405.14129 • Published • 14
Collections
Discover the best community collections!
Collections including paper arxiv:2407.15841
-
Video as the New Language for Real-World Decision Making
Paper • 2402.17139 • Published • 22 -
Learning and Leveraging World Models in Visual Representation Learning
Paper • 2403.00504 • Published • 34 -
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
Paper • 2403.01422 • Published • 30 -
VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models
Paper • 2403.05438 • Published • 22
-
Adapting Large Language Models via Reading Comprehension
Paper • 2309.09530 • Published • 78 -
An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models
Paper • 2309.09958 • Published • 19 -
Noise-Aware Training of Layout-Aware Language Models
Paper • 2404.00488 • Published • 10 -
Streaming Dense Video Captioning
Paper • 2404.01297 • Published • 13