Video-CCAM-v1.2
Collection
Better performances && Chinese support.
•
2 items
•
Updated
Video-CCAM-4B-v1.2 is a lightweight Video-MLLM developed by TencentQQ Multimedia Research Team, built upon Phi-3.5-mini-instruct and SigLIP SO400M. Compared to previous versions, it has better performances on public benchmarks and supports Chinese response.
Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.9/3.10.
pip install -U pip torch transformers accelerate peft decord pysubs2 imageio
# flash attention support
pip install flash-attn --no-build-isolation
import os
import torch
from huggingface_hub import snapshot_download
from PIL import Image
from transformers import AutoModel
from eval import load_decord
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
# if you have downloaded this model, just replace the following line with your local path
model_path = snapshot_download(repo_id='JaronTHU/Video-CCAM-4B-v1.2')
videoccam = AutoModel.from_pretrained(
model_path,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map='cuda:0',
attn_implementation='flash_attention_2'
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
image_processor = AutoImageProcessor.from_pretrained(model_path)
messages = [
[
{
'role': 'user',
'content': '<image>\nDescribe this image in detail.'
}
], [
{
'role': 'user',
'content': '<video>\n请仔细描述这个视频。'
}
]
]
images = [
[Image.open('assets/example_image.jpg').convert('RGB')],
load_decord('assets/example_video.mp4', sample_type='uniform', num_frames=32)
]
response = videoccam.chat(messages, images, tokenizer, image_processor, max_new_tokens=512, do_sample=False)
print(response)
Please refer to Video-CCAM for more details.
Benchmark | Video-CCAM-4B | Video-CCAM-4B-v1.1 | Video-CCAM-4B-v1.2 |
---|---|---|---|
MVBench (32 frames) | 57.43 | 62.80 | 66.28 |
Video-MME (w/o sub, 96 frames) | 49.7 | 50.1 | 51.5 |
Video-MME (w sub, 96 frames) | 52.8 | 51.2 | 54.5 |
MLVU (M-Avg, 96 frames) | 57.3 | 56.5 | 61.0 |
VideoVista (96 frames) | 68.09 | 70.82 | 73.44 |
The project is licensed under the Apache 2.0 License and is restricted to uses that comply with the license agreements of Phi-3.5-mini-instruct and SigLIP SO400M.