Can instructBlip process videos

by UncleanCode - opened Sep 14, 2023

Sep 14, 2023

I recently looked at the source of the blip2_vicuna-instruct7b on Salesforce/LAVIS repository and found a code for handling videos. I don't know if this is in the hugging face instructBlip model. So I'm asking if instructBlip can handle videos and if yes, how do I go about it?

nielsr

Sep 15, 2023

Hi,

Thanks for your interest in InstructBLIP. Support for videos is not yet present in the Transformers library. Did the authors release any checkpoints trained on video?

UncleanCode

Sep 16, 2023

I'm unaware of that currently. I'd check to see if there is. What I saw was just a code line for handling videos with low frame count.

louis030195

Oct 8, 2023

also interested in processing videos

ybelkada

Oct 9, 2023

Hi @UncleanCode @louis030195

Can you share the snippet for handling videos from the original authors? That can be probably adapted a bit to use transformers model

tkaintura

Feb 23, 2024

Hi,

I'm trying to run the demo from the page https://huggingface.co/docs/transformers/main/en/model_doc/instructblip#transformers.InstructBlipForConditionalGeneration at the end and the model des not generate text instead is give this :

Loading checkpoint shards: 100%|██████████| 4/4 [00:24<00:00, 6.10s/it]
/home/tanya.kaintura/Project/myenv/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:412: UserWarning: do_sample is set to False. However, top_p is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p.
warnings.warn(
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

nielsr

Feb 24, 2024

Hi,

For videos I recommend taking a look at VideoBLIP: https://huggingface.co/models?other=video-to-text

nielsr

Apr 11, 2024

PR is open for it now here: https://github.com/huggingface/transformers/pull/30182

nielsr

Jun 28, 2024

Update, InstructBLIP-Video is now supported! https://huggingface.co/docs/transformers/main/en/model_doc/instructblipvideo

CennetOguz

Jul 10, 2024

•

edited Jul 10, 2024

Update, InstructBLIP-Video is now supported! https://huggingface.co/docs/transformers/main/en/model_doc/instructblipvideo

Unfortunately, the generation example on the page is not working. Additionally, could you please provide an example for video and text feature extraction? Error message : TypeError: InstructBlipVideoForConditionalGeneration.forward() got an unexpected keyword argument 'videos'

nielsr

Jul 10, 2024

Pinging @RaushanTurganbay here

RaushanTurganbay

Jul 10, 2024

•

edited Jul 10, 2024

@CennetOguz the example code had typos, will fix it on main soon. You can use the following code to generate:

from transformers import InstructBlipVideoProcessor, InstructBlipVideoForConditionalGeneration
import torch
from huggingface_hub import hf_hub_download
import av
import numpy as np

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.
    Args:
        container (`av.container.input.InputContainer`): PyAV container.
        indices (`List[int]`): List of frame indices to decode.
    Returns:
        result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

model = InstructBlipVideoForConditionalGeneration.from_pretrained("Salesforce/instructblip-vicuna-7b", device_map="auto")
processor = InstructBlipVideoProcessor.from_pretrained("Salesforce/instructblip-vicuna-7b")

file_path = hf_hub_download(repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset")
container = av.open(file_path)

# sample uniformly 4 frames from the video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 4).astype(int)
clip = read_video_pyav(container, indices)

prompt = "What is happening in the video?"
inputs = processor(text=prompt, images=clip, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    do_sample=False,
    num_beams=5,
    max_length=256,
    repetition_penalty=1.5,
    length_penalty=1.0,
)
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip()
print(generated_text)

Regarding feature extraction:
From the project repo it looks like InstructBlip models do not support feature extraction since they do not have a projection head to project text/vision embeds to the same latent space. However Blip2 has support for feature extraction given in this notebook. The PR to add ITM capability to Transformers Blip2 is in progress, you can track it here

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment