ValueError: Number of image tokens in input_ids (0) different from num_images (1)

by Dilllllll - opened Jun 28

Discussion

Dilllllll

Jun 28

When I run
output = model.generate(**inputs_video, max_new_tokens=100, do_sample=False)

Dilllllll

Jun 28

My complete code:
model_id = "llava-hf/LLaVA-NeXT-Video-34B-DPO-hf"
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto"
)
processor = LlavaNextVideoProcessor.from_pretrained(model_id)
import requests
from PIL import Image
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "What are these?"},
{"type": "image"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs_image = processor(prompt, images=raw_image, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs_image, max_new_tokens=100, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

Dilllllll

Jul 1

processor = LlavaNextVideoProcessor.from_pretrained(model_id,use_fast=False,)

Bugs will be fixed

Dilllllll changed discussion status to closed Jul 1

Dilllllll changed discussion status to open Jul 1

Dilllllll

Jul 1

But when I use a video to test, this error appears again

code:
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)
inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to("cuda", torch.bfloat16)
output = model.generate(**inputs_video, max_new_tokens=30, do_sample=False)

Dilllllll changed discussion status to closed Jul 1

Dilllllll changed discussion status to open Jul 1

nielsr

Llava Hugging Face org Jul 1

Pinging @RaushanTurganbay here

RaushanTurganbay

Llava Hugging Face org Jul 8

@Dilllllll from the provided code it's unclear if you're trying to run inference with an image or video. Please note that you need to specify the correct modality (image or video) in conversation template.

Can you plz share the a runnable code for reproducing the error if it's not resolved yet?

Dilllllll

Jul 8

code:
import os
import av
import torch
from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration
import numpy as np
from huggingface_hub import hf_hub_download
import time
model_id = "./llava-hf/LLaVA-NeXT-Video-34B-DPO-hf"

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto"
).eval()

processor = LlavaNextVideoProcessor.from_pretrained(model_id)
def read_video_pyav(container, indices):
'''
Decode the video with PyAV decoder.
Args:
container (av.container.input.InputContainer): PyAV container.
indices (List[int]): List of frame indices to decode.
Returns:
result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
'''
frames = []
container.seek(0)
start_index = indices[0]
end_index = indices[-1]
for i, frame in enumerate(container.decode(video=0)):
if i > end_index:
break
if i >= start_index and i in indices:
frames.append(frame)
return np.stack([x.to_ndarray(format="rgb24") for x in frames])

conversation = [
{

    "role": "user",
    "content": [
        {"type": "text", "text": "Why is this video funny?"},
        {"type": "video"},
        ],
},

]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
video_path = "./sample_demo_1.mp4"
container = av.open(video_path)
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 2).astype(int)
clip = read_video_pyav(container, indices)
inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to("cuda", torch.float16)
output = model.generate(**inputs_video, max_new_tokens=30)
print(processor.decode(output[0][2:], skip_special_tokens=True))

Dilllllll

Jul 8

@RaushanTurganbay ,no matter whether the input is an image or a video, this error will appear

RaushanTurganbay

Llava Hugging Face org Jul 8

@Dilllllll thanks, I found the bug. The chat template was not correct, I updated the files on the hub, should be working now

Dilllllll

Jul 9

@RaushanTurganbay Thanks, the code works.

Dilllllll changed discussion status to closed Jul 9

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment