File size: 21,075 Bytes
8fb958b
3660ce4
f239efc
8fb958b
f239efc
8fb958b
f239efc
8fb958b
 
 
 
f239efc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
---
title: Plava 13b Demo
emoji: 👁
colorFrom: blue
colorTo: yellow
sdk: gradio
sdk_version: 4.27.0
app_file: app.py
pinned: false
---

<div align="center">

<h2><a href="https://pllava.github.io/">PLLaVA: Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning</a></h2>

[Lin Xu](https://scholar.google.com/citations?user=_Gu69coAAAAJ), [Yilin Zhao](https://ermu2001.github.io/me.io/), [Daquan Zhou](https://scholar.google.com/citations?user=DdCAbWwAAAAJ), [Zhijie Lin](https://scholar.google.com/citations?user=xXMj6_EAAAAJ), [See-Kiong Ng](https://scholar.google.com/citations?user=_wsommYAAAAJ), [Jiashi Feng](https://scholar.google.com.sg/citations?user=Q8iay0gAAAAJ&hl=en)

</div>

<!-- [![Paper](https://img.shields.io/badge/cs.CV-2311.17005-b31b1b?logo=arxiv&logoColor=red)](https://arxiv.org/abs/2311.17005) -->

**Project Page: [PLLaVA](https://pllava.github.io/)**

[![arXiv](https://img.shields.io/badge/arXiv-2404.16994-b31b1b.svg)](https://arxiv.org/abs/2404.16994)
[![YouTube Video](https://img.shields.io/badge/YouTube-Video-red)](https://www.youtube.com/watch?v=nAEje8tu18U)
[![Model on HF](https://huggingface.co/datasets/huggingface/badges/resolve/main/model-on-hf-sm-dark.svg)](https://huggingface.co/ermu2001/pllava-34b)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/zeroshot-video-question-answer-on-activitynet)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-activitynet?p=pllava-parameter-free-llava-extension-from-1)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=pllava-parameter-free-llava-extension-from-1)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/zeroshot-video-question-answer-on-msvd-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msvd-qa?p=pllava-parameter-free-llava-extension-from-1)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/video-question-answering-on-mvbench)](https://paperswithcode.com/sota/video-question-answering-on-mvbench?p=pllava-parameter-free-llava-extension-from-1)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/zeroshot-video-question-answer-on-tgif-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-tgif-qa?p=pllava-parameter-free-llava-extension-from-1)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=pllava-parameter-free-llava-extension-from-1)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/video-based-generative-performance-3)](https://paperswithcode.com/sota/video-based-generative-performance-3?p=pllava-parameter-free-llava-extension-from-1)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=pllava-parameter-free-llava-extension-from-1)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/video-based-generative-performance-2)](https://paperswithcode.com/sota/video-based-generative-performance-2?p=pllava-parameter-free-llava-extension-from-1)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/video-based-generative-performance-1)](https://paperswithcode.com/sota/video-based-generative-performance-1?p=pllava-parameter-free-llava-extension-from-1)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=pllava-parameter-free-llava-extension-from-1)







![]()
<div align="center">
  <a href="https://pllava.github.io">
    <img src="assert/logo.png">
  </a>
</div>

<div align="center">
  <video src="https://github.com/magic-research/PLLaVA/assets/55656210/a6619702-12d3-489d-bfcc-0ef7105544b2" width="100%">
</div>






## Overview

Welcome to PLLAVA!

The primary purpose of this repository is to support research and the development of prototype models. It is designed to facilitate ease of experimentation and enable a clear overview of results. Please note that this repo is currently undergoing development and reconstruction.

It's important to mention that we have not optimized the response speed of the application or the frontend logic. Our goal is to maintain simplicity, clarity, and ease of development, making it accessible for both researchers and students. If you have suggestions or want to enhance the application's performance, please feel free to contact us or contribute to the project.


We've briefly introduce our work in section [PLLAVA](#%EF%B8%8F-pllava). For more details, feel free to read our paper. Check out section [Usage](#hammer-usage) to start using this repo. If you felt our works interesting, please star us, your support is all we want. If you find our work helpful, feel free to [cite](#page_facing_up-citation) us directly.

## :fire: Updates

- **2024/4/24**: Release:
  - We are releasing our code/models/datasets.

## 🏖️ PLLAVA
<div align="center">
  <a href="https://www.youtube.com/embed/nAEje8tu18U?si=GXxjgP93j77FzDbw">
    <img src="assert/teaser.jpg">
  </a>
</div>


### Abstract

Vision-language pre-training (VLP) has significantly elevated performance across a range of vision-language applications. Yet, the pre-training process for video-related tasks demands an exceptionally high degree of computational and data resources. This paper investigates a straightforward, highly efficient, and resource-light approach to adapting an existing image-language pre-training model for video data. Our preliminary experiments reveal that directly fine-tuning pre-trained image-language models with multiple frames on video datasets leads to performance saturation or even a drop in caption-related tasks. Besides, it is also vulnerable to prompts and tends to provide short descriptions. We conducted a deep analysis and observed that the performance saturation and the vulnerability might be related to the dominant patches that exist in some single video patches. We then propose a simple pooling strategy to smooth the feature distribution along the temporal dimension and thus reduce the dominant impacts from some extreme tokens. The new model is termed Pooling LLaVA, or PLLaVA in short. With the proposed pooling strategy, we achieve new state-of-the-art performance on all evaluated datasets. Notably, on the recent popular Video ChatGPT benchmark, PLLaVA achieves a score of 3.48 out of 5 on average of five evaluated dimensions, which is the new state-of-the-art score on the leaderboard and is 0.31 higher than the previous SOTA results from GPT4V (IG-VLM). On the latest multi-choice benchmark MVBench, PLLaVA achieves 58.1% accuracy on average across 20 sub-tasks, which is the new state-of-the-art result and is 14.5% higher than GPT4V (IG-VLM).

<div align="center"><img src="assert/module.png"></div>


### SEARCHING FOR OPTIMAL POOLING STRATEGY
There are two dimensions for the pooling strategy: the spatial dimension and the temporal dimension. We empirically found that reducing the spatial dimension with a larger temporal dimension could lead to better model performance, compared to reducing the temporal dimension directly.

<div align="center"><img src="assert/zeroshot.png"></div>


### STATE-OF-THE-ART PERFORMANCE
We compare the performance of PLLAVA with recent popular methods over both question-answer and captioning datasets. The results are shown below.

<div align="center"><img src="assert/performance.png"></div>

## :hammer: Usage

This section provides guidance on how to run, train, and evaluate our models.

### Install
First, you will need to set up the environment and download some pre-trained weights. 

This repo is built up using [transformers](https://github.com/huggingface/transformers) for model construction along with [accelerate](https://github.com/huggingface/accelerate) for distributed training. Follow the instructions to install the needed environment.

1. Above all, the following environment set up is for python 3.10. If you choose to use conda for environment setup, we recommend creating the virtual environment with:
```bash
conda create -n pllava python=3.10
```

1. Firstly, install [pytorch](https://pytorch.org/) from the official website. The code runs on torch 2.2.1, cu118 or cu122. Select the version that suits your drive version.

```
torch                       2.2.1+cu118
torchaudio                  2.2.1+cu118
torchvision                 0.17.1+cu118
```

If your driver version is higher than cu121, you could probably try installing with the following scripts:
```bash
pip install -r requirements.txt
```

Otherwise, you would need to install a torch for your server first, then install the other packages:
```bash
pip install -r requirements.torch.txt # decide your own requirements, (this is for cu11), or install torch directly following the official website.
pip install -r requirements.no_torch.txt # install the following
```

1. Prepare the model.
We prefer to have huggingface models explicitly downloaded to a MODELS directory. However, if you are familiar with huggingface-hub usage, feel free to organize the model yourself.
```
python python_scripts/hf.py
```

Here are some detailed information of the obtained models:


| Model      | Link                                                                                                                                                  | Initialized From                                                                                              |
| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- |
| pllava-7b  | [![Model on HF](https://huggingface.co/datasets/huggingface/badges/resolve/main/model-on-hf-sm-dark.svg)](https://huggingface.co/ermu2001/pllava-7b)  | [llava-hf/llava-v1.6-vicuna-7b-hf · Hugging Face](https://huggingface.co/llava-hf/llava-v1.6-vicuna-7b-hf)   |
| pllava-13b | [![Model on HF](https://huggingface.co/datasets/huggingface/badges/resolve/main/model-on-hf-sm-dark.svg)](https://huggingface.co/ermu2001/pllava-13b) | [llava-hf/llava-v1.6-vicuna-13b-hf · Hugging Face](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf) |
| pllava-34b | [![Model on HF](https://huggingface.co/datasets/huggingface/badges/resolve/main/model-on-hf-sm-dark.svg)](https://huggingface.co/ermu2001/pllava-34b) | [llava-hf/llava-v1.6-34b-hf · Hugging Face](https://huggingface.co/llava-hf/llava-v1.6-34b-hf)               |

The model directory should look like this, where you would only need the corresponding model's weights and directory.

```
$ tree MODELS
MODELS
|-- pllava-13b
|   |-- added_tokens.json
|   |-- config.json
|   |-- generation_config.json
|   |-- model-00001-of-00006.safetensors
|   |-- model-00002-of-00006.safetensors
|   |-- model-00003-of-00006.safetensors
|   |-- model-00004-of-00006.safetensors
|   |-- model-00005-of-00006.safetensors
|   |-- model-00006-of-00006.safetensors
|   |-- model.safetensors.index.json
|   |-- preprocessor_config.json
|   |-- processor_config.json
|   |-- special_tokens_map.json
|   |-- tokenizer.json
|   |-- tokenizer.model
|   `-- tokenizer_config.json
|-- pllava-34b
|   |-- added_tokens.json
|   |-- config.json
|   |-- generation_config.json
|   |-- model-00001-of-00015.safetensors
|   |-- model-00002-of-00015.safetensors
|   |-- model-00003-of-00015.safetensors
|   |-- model-00004-of-00015.safetensors
|   |-- model-00005-of-00015.safetensors
|   |-- model-00006-of-00015.safetensors
|   |-- model-00007-of-00015.safetensors
|   |-- model-00008-of-00015.safetensors
|   |-- model-00009-of-00015.safetensors
|   |-- model-00010-of-00015.safetensors
|   |-- model-00011-of-00015.safetensors
|   |-- model-00012-of-00015.safetensors
|   |-- model-00013-of-00015.safetensors
|   |-- model-00014-of-00015.safetensors
|   |-- model-00015-of-00015.safetensors
|   |-- model.safetensors-deprecated
|   |-- model.safetensors.index.json
|   |-- preprocessor_config.json
|   |-- processor_config.json
|   |-- special_tokens_map.json
|   |-- tokenizer.json
|   |-- tokenizer.model
|   `-- tokenizer_config.json
|-- pllava-7b
    |-- added_tokens.json
    |-- config.json
    |-- generation_config.json
    |-- model-00001-of-00003.safetensors
    |-- model-00002-of-00003.safetensors
    |-- model-00003-of-00003.safetensors
    |-- model.safetensors.index.json
    |-- preprocessor_config.json
    |-- processor_config.json
    |-- special_tokens_map.json
    |-- tokenizer.json
    |-- tokenizer.model
    `-- tokenizer_config.json
```

With the above steps, you should be able to proceed on with the following usages.

### Run Application

To run our models, make sure you have downloaded a model pretrained weights from the huggingface spaces. Then, run the following scripts with the corresponding path input. Since we are only training with lora and the projector, the model to be run are determined with:

- **model_dir**: model directory, one with config.json as compatible with transformers. This refers to the base model's directory, for example "llava-hf/llava-v1.6-vicuna-7b-hf"/"ermu2001/pllava-7b"/"MODELS/pllava-7b". (default to: MODELS/plave-7b)
- **weights_dir**: your weights directory. could be the same as model_dir, but if you have a weights directory for the lora weights, you should set this weights_dir to that directory to load the lora weights. This directory should be local. Also, it would need to contain a config.json file within. (default to: ${model_dir}).

```bash
model_dir="model directory"
weights_dir="weights directory"
bash scripts/demo.sh ${model_dir} ${weights_dir}
```

Now check out the application demo and try play with PLLAVA!

### Train

Follow the following steps to reproduce our results or train your own variant:

#### 1. Data Preparation

To train our model from a starting Image-aligned Vision LLM, you would need to download the data first. Our data set up is mainly based on the original Videochat2's training data. Check out [Instruction Data](./DATA.md) to prepare the instruction training data. Ideally, setting up a root data directory and alter the code [here](./tasks/train/instruction_data.py#L6) would accomodate the data for training most smoothly.

#### 2. Start Training

Now you're only a few step away from starting the training. Follow the instructions:

##### Setup Accelerator

Customize a accelerate training config. For example, a simple config using multiple gpus with no distribution strategy (only torch DDP) would look like:

```yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```

Check out out the [Accelerate](https://huggingface.co/docs/accelerate/index) documents for more details.

##### Overwatch the training configuration

Next, you should go over a basic training configuration of the training process in [here](tasks/train/config_pllava_nframe.py). Then passing this file as the first arg to the [training script](tasks/train/train_pllava_nframe_accel.py) would utilize every arguments in the file. You can customize some of the hyper parameters for your own training process by passing them in the format of "key" "value" pair in the following arguments. A example training scripts could be find [here](scripts/train_pllava.sh).

We recommand customize a [configuration](tasks/train/config_pllava_nframe.py) to set up a customized training!

With the above steps, you would be able to start the training process. The output would be well organized in the output directory, each a qualified model directory to pass in to demo as weights_dir, since we are only saveing the lora weights and projector weights to avoide redundancy.

### Evaluation

This section mainly introduce how to reproduce the evaluation or evaluate your own model.

#### Set up Evaluation Data

Make sure you set up the "DATAS" directory as in [DATA.md](DATA.md), then you would be able to run the inference with fortune! The evaluation data directory of DATAS would look like:

```
DATAS/:
DATAS/VideoQA:
DATAS/VideoQA/TGIF_QA:
                     test_a.json
                     test_q.json
DATAS/VideoQA/TGIF_QA/videos:
                            tumblr_m4387mGrlc1r6m5e8o1_250.gif
                            ...
DATAS/VideoQA/TGIF_QA/videos_mp4:
                                tumblr_m4387mGrlc1r6m5e8o1_250.mp4
                                ...
DATAS/VideoQA/TGIF_QA/video_gif:
                               tumblr_m4387mGrlc1r6m5e8o1_250.gif
                               ...
DATAS/VideoQA/MSVD_Zero_Shot_QA:
                               test_a.json
                               test_q.json
DATAS/VideoQA/MSVD_Zero_Shot_QA/videos:
                                      -4wsuPCjDBc_5_15.avi
DATAS/VideoQA/MSVD_Zero_Shot_QA/msvd_qa:
DATAS/VideoQA/ActivityNet:
                         test_a.json
                         test_q.json
DATAS/VideoQA/ActivityNet/all_test:
                                  v_--tFD65KaK4.mp4
                                  ...
DATAS/VideoQA/MSRVTT_Zero_Shot_QA:
                                 test_a.json
                                 test_q.json
DATAS/VideoQA/MSRVTT_Zero_Shot_QA/videos:
DATAS/VideoQA/MSRVTT_Zero_Shot_QA/videos/all:
                                            video0.mp4
                                            ...

DATAS/MVBench:
             ...

DATAS/Recaption/Inter4K:
                       annotations.json
DATAS/Recaption/Inter4K/60fps:
DATAS/Recaption/Inter4K/60fps/UHD:
                                 1.mp4
                                 ...

```

#### Start Evaluate

Once you have construted the evaluation data, you can start the evaluation as in [here](scripts/eval.sh). This script is for evaluating 7B/13B models. As pllava-34b model uses a slightly different prompting, it is evaluated with this [script](scripts/eval_yiprompt.sh).

```
bash scripts/eval.sh
```

Same as running the demo, you would need to determine the model_dir and weights_dir to evaluate the model. Feel free to comment out some commands and produce partial evaluation.

#### Overwatch the Results

The evaluation results would be shown to you with our results gallery demo:

```bash
bash scripts/gallery.sh 
```

Feel free to use the compare version to compare differnt models' results or use the single gallery version to check out one model's results. They are basically the same. Check out this [script](scripts/gallery.sh) for more details

#### For Captioning and Recaptioning
Follow instructions at [DATA.md](DATA.md#extending-reacptioning) and you can extend the recaptioning data with a few steps.

Feel free to point out high quality dataset of videos, we would proceed on doing captioning on those datasets.


# :page_facing_up: Citation

If you find this project useful in your research, please consider cite:

```BibTeX
@misc{xu2024pllava,
      title={PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning}, 
      author={Lin Xu and Yilin Zhao and Daquan Zhou and Zhijie Lin and See Kiong Ng and Jiashi Feng},
      year={2024},
      eprint={2404.16994},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
```

# :dizzy: Acknowledgement

This code base is mainly built upon [Videochat2](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2). SALUTE.

We would also like to recognize and commend the following open source projects, thank you for your great contribution to the open source community:

- [LLaVA](https://github.com/haotian-liu/LLaVA): Fantastic Open Source Image LLM Model.
- [VideoChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT/tree/main): Great Evaluation Benchmarking Framework.
- [VideoLlava](https://github.com/PKU-YuanGroup/Video-LLaVA/tree/main/videollava):Video LLM repo with helpful resources.