I really like this model. Does it support video input and streaming output? Do you have any examples

#1
by michaelj - opened

I really like this model. Does it support video input and streaming output? Do you have any examples

BytedanceDouyinContent org
edited 3 days ago

Thanks for your question.

The model supports video input but does not support stream I/O. Note that this model only use video in pretrain stage, but does not add any video for sft, so the performance on video benchmark is just ok, we may release a video augmented version released later.

The use of video input is the same as multi-image, please use max_num=1 for the case of video:

# multi-image / video conversation
pixel_values1 = load_image('./examples/image1.jpg', max_num=1).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=1).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
# there need to be same number of <image> token as images
question = '<image><image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

Sign up or log in to comment