|
--- |
|
license: mit |
|
--- |
|
|
|
<div align="center"> |
|
<h1 style="margin: 0"> |
|
<img src="assets/logo.png" style="width:1.5em; vertical-align: middle; display: inline-block; margin: 0" alt="Logo"> |
|
<span style="vertical-align: middle; display: inline-block; margin: 0"><b>CaReBench: A Fine-grained Benchmark for Video Captioning and Retrieval</b></span> |
|
</h1> |
|
|
|
<p style="margin: 0"> |
|
Yifan Xu, <a href="https://scholar.google.com/citations?user=evR3uR0AAAAJ">Xinhao Li</a>, Yichun Yang, Desen Meng, Rui Huang, <a href="https://scholar.google.com/citations?user=HEuN8PcAAAAJ">Limin Wang</a> |
|
</p> |
|
|
|
<p align="center"> |
|
๐ค <a href="https://huggingface.co/MCG-NJU/CaRe-7B">Model</a>    |    ๐ค <a href="https://huggingface.co/datasets/MCG-NJU/CaReBench">Data</a>   ๏ฝ    ๐ <a href="https://arxiv.org/pdf/2501.00513">Paper</a>    |
|
</p> |
|
</div> |
|
|
|
|
|
## ๐ Introduction |
|
|
|
This is CaRe trained after Stage-I and Stage-II. Refer to [our paper](https://arxiv.org/pdf/2501.00513) for details. |
|
|
|
## Usage |
|
|
|
Loading from the huggingface remote path is not tested. It is **recommended** to download this checkpoint to your local environment to prevent potential bugs. |
|
|
|
### For Captioning Tasks |
|
```python |
|
from utils.video import read_frames_decord |
|
from models.modeling_captioners import AutoCaptioner |
|
|
|
captioner = AutoCaptioner.from_pretrained('path/to/checkpoints/CaRe-7B') |
|
frames = read_frames_decord(video_path='assets/demo.mp4', num_frames=32) |
|
description = captioner.describe(frames.unsqueeze(0)) |
|
print(description[0]) |
|
``` |
|
|
|
### For Retrieval Tasks |
|
```python |
|
from utils.video import read_frames_decord |
|
from models.modeling_encoders import AutoEncoder |
|
from torch.nn.functional import cosine_similarity |
|
|
|
encoder = AutoEncoder.from_pretrained('path/to/checkpoints/CaRe-7B') |
|
frames = read_frames_decord(video_path='assets/demo.mp4', num_frames=32) |
|
text = "This video features a man slicing tomatoes in the kitchen." |
|
vision_emb = encoder.encode_vision(frames.unsqueeze(0)) |
|
text_emb = encoder.encode_text(text) |
|
print(f'Vision embedding shape: {vision_emb.shape}') |
|
print(f'Text embedding shape: {text_emb.shape}') |
|
print(f'Cosine similarity: {cosine_similarity(vision_emb, text_emb)}') |
|
``` |
|
|