BytedanceDouyinContent/SAIL-VL-2B · I really like this model. Does it support video input and streaming output? Do you have any examples

Thanks for your question.

The model supports video input but does not support stream I/O. Note that this model only use video in pretrain stage, but does not add any video for sft, so the performance on video benchmark is just ok, we may release a video augmented version released later.

The use of video input is the same as multi-image, please use max_num=1 for the case of video:

# multi-image / video conversation
pixel_values1 = load_image('./examples/image1.jpg', max_num=1).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=1).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
# there need to be same number of <image> token as images
question = '<image><image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')