Collections
Discover the best community collections!
Collections including paper arxiv:2404.03413
-
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Paper • 2404.03413 • Published • 25 -
Scaling Data-Constrained Language Models
Paper • 2305.16264 • Published • 17 -
Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space
Paper • 2406.19370 • Published • 1
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 25 -
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Paper • 2404.03413 • Published • 25 -
openai/clip-vit-large-patch14-336
Zero-Shot Image Classification • Updated • 5.56M • 197 -
openai/clip-vit-base-patch32
Zero-Shot Image Classification • Updated • 28.5M • • 518
-
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Paper • 2403.10517 • Published • 31 -
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Paper • 2403.11481 • Published • 12 -
VideoMamba: State Space Model for Efficient Video Understanding
Paper • 2403.06977 • Published • 27 -
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
Paper • 2403.01422 • Published • 26
-
How Far Are We from Intelligent Visual Deductive Reasoning?
Paper • 2403.04732 • Published • 18 -
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper • 2403.07508 • Published • 75 -
DragAnything: Motion Control for Anything using Entity Representation
Paper • 2403.07420 • Published • 13 -
Learning and Leveraging World Models in Visual Representation Learning
Paper • 2403.00504 • Published • 31
-
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding
Paper • 2403.09626 • Published • 13 -
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Paper • 2403.10517 • Published • 31 -
VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis
Paper • 2403.13501 • Published • 9 -
LITA: Language Instructed Temporal-Localization Assistant
Paper • 2403.19046 • Published • 18
-
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
Paper • 2403.01422 • Published • 26 -
VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding
Paper • 2403.09530 • Published • 8 -
VidToMe: Video Token Merging for Zero-Shot Video Editing
Paper • 2312.10656 • Published • 10 -
TC4D: Trajectory-Conditioned Text-to-4D Generation
Paper • 2403.17920 • Published • 16
-
Video as the New Language for Real-World Decision Making
Paper • 2402.17139 • Published • 18 -
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Paper • 2310.19512 • Published • 15 -
VideoMamba: State Space Model for Efficient Video Understanding
Paper • 2403.06977 • Published • 27 -
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
Paper • 2401.09047 • Published • 13