ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance Paper • 2412.06673 • Published 12 days ago • 11
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Paper • 2412.05271 • Published 15 days ago • 115
WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model Paper • 2411.17459 • Published 25 days ago • 10
Open-Sora Plan: Open-Source Large Video Generation Model Paper • 2412.00131 • Published 23 days ago • 32
ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation Paper • 2406.18522 • Published Jun 26 • 18
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions Paper • 2406.04325 • Published Jun 6 • 72
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models Paper • 2401.15947 • Published Jan 29 • 49
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models Paper • 2311.16103 • Published Nov 27, 2023 • 1
LanguageBind 1.5 Model Collection a collection of LanguageBind based on VIDAL-45M • 2 items • Updated May 23 • 2
LanguageBind 1.0 Model Collection a collection of LanguageBind based on VIDAL-10M • 9 items • Updated Jan 28 • 4
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment Paper • 2310.01852 • Published Oct 3, 2023 • 2
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Paper • 2311.10122 • Published Nov 16, 2023 • 26