Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs Paper • 2403.12596 • Published Mar 19 • 9
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models Paper • 2404.13013 • Published Apr 19 • 30
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Paper • 2404.16994 • Published Apr 25 • 35
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability Paper • 2405.14129 • Published May 23 • 12
Merlin:Empowering Multimodal LLMs with Foresight Minds Paper • 2312.00589 • Published Nov 30, 2023 • 24
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding Paper • 2407.15754 • Published Jul 22 • 19
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models Paper • 2407.15841 • Published Jul 22 • 39
Efficient Inference of Vision Instruction-Following Models with Elastic Cache Paper • 2407.18121 • Published Jul 25 • 16
VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges Paper • 2409.01071 • Published Sep 2 • 26
LongVLM: Efficient Long Video Understanding via Large Language Models Paper • 2404.03384 • Published Apr 4
Visual Context Window Extension: A New Perspective for Long Video Understanding Paper • 2409.20018 • Published Sep 30 • 9
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents Paper • 2410.10594 • Published Oct 14 • 22