Safetensors
qwen2_vl
File size: 2,229 Bytes
27a1606
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
---
license: mit
---

<div align="center">
  <h1 style="margin: 0">
    <img src="assets/logo.png" style="width:1.5em; vertical-align: middle; display: inline-block; margin: 0" alt="Logo">
    <span style="vertical-align: middle; display: inline-block; margin: 0"><b>CaReBench: A Fine-grained Benchmark for Video Captioning and Retrieval</b></span>
  </h1>
  
  <p style="margin: 0">
    Yifan Xu, <a href="https://scholar.google.com/citations?user=evR3uR0AAAAJ">Xinhao Li</a>, Yichun Yang, Desen Meng, Rui Huang, <a href="https://scholar.google.com/citations?user=HEuN8PcAAAAJ">Limin Wang</a>
  </p>
  
  <p align="center">
    🤗 <a href="https://huggingface.co/MCG-NJU/CaRe-7B">Model</a> &nbsp&nbsp | &nbsp&nbsp 🤗 <a href="https://huggingface.co/datasets/MCG-NJU/CaReBench">Data</a> &nbsp&nbsp| &nbsp&nbsp 📑 <a href="https://arxiv.org/pdf/2501.00513">Paper</a> &nbsp&nbsp 
  </p>
</div>


## 📝 Introduction

This is CaRe trained after Stage-I and Stage-II. Refer to [our paper](https://arxiv.org/pdf/2501.00513) for details. 

## Usage

Loading from the huggingface remote path is not tested. It is **recommended** to download this checkpoint to your local environment to prevent potential bugs.

### For Captioning Tasks
```python
from utils.video import read_frames_decord
from models.modeling_captioners import AutoCaptioner

captioner = AutoCaptioner.from_pretrained('path/to/checkpoints/CaRe-7B')
frames = read_frames_decord(video_path='assets/demo.mp4', num_frames=32)
description = captioner.describe(frames.unsqueeze(0))
print(description[0])
```

### For Retrieval Tasks
```python
from utils.video import read_frames_decord
from models.modeling_encoders import AutoEncoder
from torch.nn.functional import cosine_similarity

encoder = AutoEncoder.from_pretrained('path/to/checkpoints/CaRe-7B')
frames = read_frames_decord(video_path='assets/demo.mp4', num_frames=32)
text = "This video features a man slicing tomatoes in the kitchen."
vision_emb = encoder.encode_vision(frames.unsqueeze(0))
text_emb = encoder.encode_text(text)
print(f'Vision embedding shape: {vision_emb.shape}')
print(f'Text embedding shape: {text_emb.shape}')
print(f'Cosine similarity: {cosine_similarity(vision_emb, text_emb)}')
```