SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Abstract
We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language model (LLM) that can jointly capture the detailed spatial semantics and long-range temporal context without exceeding the token budget of commonly used LLMs. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled video frames in an effective way. Specifically, the Slow pathway extracts features at a low frame rate while keeping as many spatial details as possible (e.g., with 24x24 tokens), and the Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. As a result, this design allows us to adequately capture both spatial and temporal features that are beneficial for understanding details along the video. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks. On some benchmarks, it achieves comparable or even better performance compared to state-of-the-art Video LLMs that are fine-tuned on video datasets.
Community
Thanks to all the authors for such a comprehensive and insightful research paper.
I have a question that I hope you can clarify. In the section discussing the slow pathway, it is mentioned that it outputs 24×24 tokens. However, in the subsequent calculation, it is shown as 12×24 tokens. Could you please clarify this?
Additionally, is there a plan to release the models or code to facilitate the reproduction of the results and application to downstream tasks?
Thank you for your interest in our paper. The idea of the Slow pathway is to use a low frame rate but keep more tokens in each frame. There are multiple design choices to achieve this, and 24x24 tokens are the original output from the vision encoder. However, using 24x24 tokens gets OOM problem and we found applying proper pooling operations doesn't decrease the performance. Thus, we use 12x24 tokens as default for SlowFast-LLaVA to keep as many spatial details as possible. All numbers in the main results are based on 12x24 tokens. We will clarify this in the revision.
We are still working on the code release. Stay tuned!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs (2024)
- VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding (2024)
- OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding (2024)
- video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models (2024)
- Goldfish: Vision-Language Understanding of Arbitrarily Long Videos (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper