arxiv:2406.07476

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Published on Jun 11

· Submitted by

lixin4ever on Jun 13

Upvote

Authors:

Zesen Cheng ,

Sicong Leng ,

Xin Li ,

Yongxin Zhu ,

Ziyang Luo ,

Lidong Bing

Abstract

In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.

View arXiv page View PDF Add to collection

Community

fblgit

Jun 14

You are basically modifying LLaVA , the paper gives a totally different impression. But inside the code, its all LLaVA.

    if "videollama" in model_name.lower():
        # Load LLaVA model

I mean the predecessor of this is clearly LLaVA and IMHO u missing here some important details on the paper.

lixin4ever

Paper author Jun 14

Thanks for pointing out this.

Yes, the codebase of VideoLLaMA2 is adapted from LLaVA. We have mentioned this and given credit to LLaVA in several places (e.g., videollama2_arch.py, videollama2_mistral.py, train.py, project page). We will make this clearer in the next version of our technical report.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Abstract

Community

Models citing this paper 4

Datasets citing this paper 1

Spaces citing this paper 2

Collections including this paper 8