metadata
license: mit
CaReBench: A Fine-grained Benchmark for Video Captioning and Retrieval
Yifan Xu, Xinhao Li, Yichun Yang, Desen Meng, Rui Huang, Limin Wang
π Introduction
This is CaRe trained after Stage-I and Stage-II. Refer to our paper for details.
Usage
Loading from the huggingface remote path is not tested. It is recommended to download this checkpoint to your local environment to prevent potential bugs.
For Captioning Tasks
from utils.video import read_frames_decord
from models.modeling_captioners import AutoCaptioner
captioner = AutoCaptioner.from_pretrained('path/to/checkpoints/CaRe-7B')
frames = read_frames_decord(video_path='assets/demo.mp4', num_frames=32)
description = captioner.describe(frames.unsqueeze(0))
print(description[0])
For Retrieval Tasks
from utils.video import read_frames_decord
from models.modeling_encoders import AutoEncoder
from torch.nn.functional import cosine_similarity
encoder = AutoEncoder.from_pretrained('path/to/checkpoints/CaRe-7B')
frames = read_frames_decord(video_path='assets/demo.mp4', num_frames=32)
text = "This video features a man slicing tomatoes in the kitchen."
vision_emb = encoder.encode_vision(frames.unsqueeze(0))
text_emb = encoder.encode_text(text)
print(f'Vision embedding shape: {vision_emb.shape}')
print(f'Text embedding shape: {text_emb.shape}')
print(f'Cosine similarity: {cosine_similarity(vision_emb, text_emb)}')