--- license: mit ---

CaReBench: A Fine-grained Benchmark for Video Captioning and Retrieval

Yifan Xu, Xinhao Li, Yichun Yang, Desen Meng, Rui Huang, Limin Wang

🤗 Model | 🤗 Data ｜ 📑 Paper

## 📝 Introduction This is CaRe trained after Stage-I and Stage-II. Refer to [our paper](https://arxiv.org/pdf/2501.00513) for details. ## Usage Loading from the huggingface remote path is not tested. It is **recommended** to download this checkpoint to your local environment to prevent potential bugs. ### For Captioning Tasks ```python from utils.video import read_frames_decord from models.modeling_captioners import AutoCaptioner captioner = AutoCaptioner.from_pretrained('path/to/checkpoints/CaRe-7B') frames = read_frames_decord(video_path='assets/demo.mp4', num_frames=32) description = captioner.describe(frames.unsqueeze(0)) print(description[0]) ``` ### For Retrieval Tasks ```python from utils.video import read_frames_decord from models.modeling_encoders import AutoEncoder from torch.nn.functional import cosine_similarity encoder = AutoEncoder.from_pretrained('path/to/checkpoints/CaRe-7B') frames = read_frames_decord(video_path='assets/demo.mp4', num_frames=32) text = "This video features a man slicing tomatoes in the kitchen." vision_emb = encoder.encode_vision(frames.unsqueeze(0)) text_emb = encoder.encode_text(text) print(f'Vision embedding shape: {vision_emb.shape}') print(f'Text embedding shape: {text_emb.shape}') print(f'Cosine similarity: {cosine_similarity(vision_emb, text_emb)}') ```