language:
- en
license: llama2
pipeline_tag: image-text-to-text
LLaVA-NeXT-Video Model Card
Below is the model card of LLaVa-NeXT-Video model 7b, which is copied from the original Llava model card that you can find here.
Check out also the Google Colab demo to run Llava on a free-tier Google Colab instance:
Disclaimer: The team releasing LLaVa-NeXT-Video did not write a model card for this model so this model card has been written by the Hugging Face team.
π Model details
Model type:
LLaVA-Next-Video is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data. The model is buit on top of LLaVa-NeXT by tuning on a mix of video and image data. The videos were sampled uniformly to be 32 frames per clip.
Base LLM: lmsys/vicuna-7b-v1.5
Model date:
LLaVA-Next-Video-7B was trained in April 2024.
Paper or resources for more information:
https://github.com/LLaVA-VL/LLaVA-NeXT
π Training dataset
Image
- 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
- 158K GPT-generated multimodal instruction-following data.
- 500K academic-task-oriented VQA data mixture.
- 50K GPT-4V data mixture.
- 40K ShareGPT data.
Video
- 100K VideoChatGPT-Instruct.
π Evaluation dataset
A collection of 4 benchmarks, including 3 academic VQA benchmarks and 1 captioning benchmark.
π How to use the model
First, make sure to have transformers >= 4.42.0
.
The model supports multi-visual and multi-prompt generation. Meaning that you can pass multiple images/videos in your prompt. Make sure also to follow the correct prompt template (USER: xxx\nASSISTANT:
) and add the token <image>
or <video>
to the location where you want to query images/videos:
Below is an example script to run generation in float16
precision on a GPU device:
import av
import torch
from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration
model_id = "llava-hf/LLaVA-NeXT-Video-7B-hf"
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
).to(0)
processor = LlavaNextVideoProcessor.from_pretrained(model_id)
def read_video_pyav(container, indices):
'''
Decode the video with PyAV decoder.
Args:
container (`av.container.input.InputContainer`): PyAV container.
indices (`List[int]`): List of frame indices to decode.
Returns:
result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
'''
frames = []
container.seek(0)
start_index = indices[0]
end_index = indices[-1]
for i, frame in enumerate(container.decode(video=0)):
if i > end_index:
break
if i >= start_index and i in indices:
frames.append(frame)
return np.stack([x.to_ndarray(format="rgb24") for x in frames])
prompt = "USER: <video>\nWhy is this video funny? ASSISTANT:"
video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
container = av.open(video_path)
# sample uniformly 8 frames from the video, can sample more for longer videos
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)
inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to(model.device)
output = model.generate(**inputs_video, max_new_tokens=100, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))
Inference with images as inputs
To generate from images use the below code after loading the model as shown above:
import requests
from PIL import Image
prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs_image = processor(prompt, images=raw_image, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs_video, max_new_tokens=100, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))
Inference with images and videos as inputs
To generate from images and videos in one generate use the below code after loading the model as shown above:
prompts = [
"USER: <image>\nWhat's the content of the image? ASSISTANT:",
"USER: <video>\nWhy is this video funny? ASSISTANT:"
]
inputs = processor(text=prompts, images=image, videos=clip, padding=True, return_tensors="pt").to(model.device)
# Generate
generate_ids = model.generate(**inputs, max_new_tokens=100)
out = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(out)
Model optimization
4-bit quantization through bitsandbytes
library
First make sure to install bitsandbytes
, pip install bitsandbytes
and make sure to have access to a CUDA compatible GPU device. Simply change the snippet above with:
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
+ load_in_4bit=True
)
Use Flash-Attention 2 to further speed-up generation
First make sure to install flash-attn
. Refer to the original repository of Flash Attention regarding that package installation. Simply change the snippet above with:
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
+ use_flash_attention_2=True
).to(0)
π License
Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
π― Intended use
Primary intended uses:
The primary use of LLaVA is research on large multimodal models and chatbots.
Primary intended users:
The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
βοΈ Citation
If you find our paper and code useful in your research:
@misc{zhang2024llavanextvideo,
title={LLaVA-NeXT: A Strong Zero-shot Video Understanding Model},
url={https://llava-vl.github.io/blog/2024-04-30-llava-next-video/},
author={Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan},
month={April},
year={2024}
}
@misc{liu2024llavanext,
title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
month={January},
year={2024}
}