File size: 6,401 Bytes
586c62e d6a6c2a 975c2d8 586c62e 963db56 7e1dc16 975c2d8 f3a2d76 7e1dc16 975c2d8 7e1dc16 bf6e2b7 7e1dc16 586c62e 9ee491e 586c62e 74cac34 586c62e 74cac34 963db56 74cac34 586c62e 7e1dc16 4a19fe9 975c2d8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
---
library_name: transformers
license: mit
pipeline_tag: video-text-to-text
---
<a href='https://arxiv.org/abs/2501.13919v2'><img src='https://img.shields.io/badge/arXiv-paper-red'></a><a href='https://ruili33.github.io/tpo_website/'><img src='https://img.shields.io/badge/project-TPO-blue'></a>
<a href='https://huggingface.co/collections/ruili0/temporal-preference-optimization-67874b451f65db189fa35e10'><img src='https://img.shields.io/badge/model-checkpoints-yellow'></a>
<a href='https://github.com/ruili33/TPO'><img src='https://img.shields.io/badge/github-repository-purple'></a>
<img src="cvpr_figure_TPO.png"></img>
# LLaVA-Video-7B-Qwen2-TPO
LLaVA-Video-7B-Qwen2-TPO, introduced by paper [Temporal Preference Optimization for Long-form Video Understanding](https://huggingface.co/papers/2501.13919v1), optimized
by temporal preference based on LLaVA-Video-7B-Qwen2. The LLaVA-Video-7B-Qwen2-TPO model establishes state-of-the-art performance across a range of
benchmarks, demonstrating an average performance improvement of 1.5% compared to LLaVA-Video-7B.
Notably, it emerges as the leading 7B parameter model on the Video-MME benchmark.
Project page: https://ruili33.github.io/tpo_website/
Code: https://github.com/ruili33/TPO
## Evaluation Results
| **Model** | **Size** | **LongVideoBench** | **MLVU** | **VideoMME (Average)** |
|-------------------------------------|----------|---------------------|----------|----------------------|
| **NVILA [1]** | 7B | 57.7 | 70.1 | 64.2/70.0 |
| **LLaVA-Video-7B [2]** | 7B | 58.2 | 70.8 | 63.3/69.7 |
| **LLaVA-Video-7B-Qwen2-TPO** | 7B | **60.1** | **71.1** | **65.6/71.5** |
## Get Started
Use the code below to get started with the model. For more information, please refer to our [github repository](https://github.com/ruili33/TPO).
```
# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from PIL import Image
import requests
import copy
import torch
import sys
import warnings
from decord import VideoReader, cpu
import numpy as np
warnings.filterwarnings("ignore")
def load_video(self, video_path, max_frames_num,fps=1,force_sample=False):
if max_frames_num == 0:
return np.zeros((1, 336, 336, 3))
vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
total_frame_num = len(vr)
video_time = total_frame_num / vr.get_avg_fps()
fps = round(vr.get_avg_fps()/fps)
frame_idx = [i for i in range(0, len(vr), fps)]
frame_time = [i/fps for i in frame_idx]
if len(frame_idx) > max_frames_num or force_sample:
sample_fps = max_frames_num
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
frame_idx = uniform_sampled_frames.tolist()
frame_time = [i/vr.get_avg_fps() for i in frame_idx]
frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
spare_frames = vr.get_batch(frame_idx).asnumpy()
# import pdb;pdb.set_trace()
return spare_frames,frame_time,video_time
pretrained = "ruili0/LLaVA-Video-7B-TPO"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map) # Add any other thing you want to pass in llava_model_args
model.eval()
video_path = "local_demo/assets/dc_demo.mp4"
max_frames_num = "64"
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
video = [video]
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
question = DEFAULT_IMAGE_TOKEN + f"{time_instruciton}\nPlease describe this video in detail."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
cont = model.generate(
input_ids,
images=video,
modalities= ["video"],
do_sample=False,
temperature=0,
max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
print(text_outputs)
```
## License
This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models (Qwen2 license). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.
## Citation
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
```
@article{li2025temporal,
title={Temporal Preference Optimization for Long-Form Video Understanding},
author={Li, Rui and Wang, Xiaohan and Zhang, Yuhui and Wang, Zeyu and Yeung-Levy, Serena},
journal={arXiv preprint arXiv:2501.13919},
year={2025}
}
```
**References:**
[1]. Liu, Z., Zhu, L., Shi, B., Zhang, Z., Lou, Y., Yang, S., ... & Lu, Y. (2024). NVILA: Efficient Frontier Visual Language Models. arXiv preprint arXiv:2412.04468.
[2]. Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., & Li, C. (2024). Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713. |