ValueError: Number of image tokens in input_ids (0) different from num_images (1)
When I run
output = model.generate(**inputs_video, max_new_tokens=100, do_sample=False)
My complete code:
model_id = "llava-hf/LLaVA-NeXT-Video-34B-DPO-hf"
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto"
)
processor = LlavaNextVideoProcessor.from_pretrained(model_id)
import requests
from PIL import Image
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "What are these?"},
{"type": "image"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs_image = processor(prompt, images=raw_image, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs_image, max_new_tokens=100, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))
processor = LlavaNextVideoProcessor.from_pretrained(model_id,use_fast=False,)
Bugs will be fixed
But when I use a video to test, this error appears again
code:
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)
inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to("cuda", torch.bfloat16)
output = model.generate(**inputs_video, max_new_tokens=30, do_sample=False)
Pinging @RaushanTurganbay here
@Dilllllll from the provided code it's unclear if you're trying to run inference with an image or video. Please note that you need to specify the correct modality (image or video) in conversation template.
Can you plz share the a runnable code for reproducing the error if it's not resolved yet?
code:
import os
import av
import torch
from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration
import numpy as np
from huggingface_hub import hf_hub_download
import time
model_id = "./llava-hf/LLaVA-NeXT-Video-34B-DPO-hf"
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto"
).eval()
processor = LlavaNextVideoProcessor.from_pretrained(model_id)
def read_video_pyav(container, indices):
'''
Decode the video with PyAV decoder.
Args:
container (av.container.input.InputContainer
): PyAV container.
indices (List[int]
): List of frame indices to decode.
Returns:
result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
'''
frames = []
container.seek(0)
start_index = indices[0]
end_index = indices[-1]
for i, frame in enumerate(container.decode(video=0)):
if i > end_index:
break
if i >= start_index and i in indices:
frames.append(frame)
return np.stack([x.to_ndarray(format="rgb24") for x in frames])
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "Why is this video funny?"},
{"type": "video"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
video_path = "./sample_demo_1.mp4"
container = av.open(video_path)
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 2).astype(int)
clip = read_video_pyav(container, indices)
inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to("cuda", torch.float16)
output = model.generate(**inputs_video, max_new_tokens=30)
print(processor.decode(output[0][2:], skip_special_tokens=True))
@RaushanTurganbay ,no matter whether the input is an image or a video, this error will appear
@Dilllllll thanks, I found the bug. The chat template was not correct, I updated the files on the hub, should be working now
@RaushanTurganbay Thanks, the code works.