MCG-NJU
/

CaRe-7B

Model card Files Files and versions Community

CaRe-7B / README.md

geekifan's picture

Upload 22 files

27a1606 verified 7 days ago

|

history blame contribute delete

2.23 kB

	---
	license: mit
	---

	<div align="center">
	<h1 style="margin: 0">
	<img src="assets/logo.png" style="width:1.5em; vertical-align: middle; display: inline-block; margin: 0" alt="Logo">
	<span style="vertical-align: middle; display: inline-block; margin: 0"><b>CaReBench: A Fine-grained Benchmark for Video Captioning and Retrieval</b></span>
	</h1>

	<p style="margin: 0">
	Yifan Xu, <a href="https://scholar.google.com/citations?user=evR3uR0AAAAJ">Xinhao Li</a>, Yichun Yang, Desen Meng, Rui Huang, <a href="https://scholar.google.com/citations?user=HEuN8PcAAAAJ">Limin Wang</a>
	</p>

	<p align="center">
	🤗 <a href="https://huggingface.co/MCG-NJU/CaRe-7B">Model</a> &nbsp&nbsp \| &nbsp&nbsp 🤗 <a href="https://huggingface.co/datasets/MCG-NJU/CaReBench">Data</a> &nbsp&nbsp｜ &nbsp&nbsp 📑 <a href="https://arxiv.org/pdf/2501.00513">Paper</a> &nbsp&nbsp
	</p>
	</div>


	## 📝 Introduction

	This is CaRe trained after Stage-I and Stage-II. Refer to [our paper](https://arxiv.org/pdf/2501.00513) for details.

	## Usage

	Loading from the huggingface remote path is not tested. It is recommended to download this checkpoint to your local environment to prevent potential bugs.

	### For Captioning Tasks
	```python
	from utils.video import read_frames_decord
	from models.modeling_captioners import AutoCaptioner

	captioner = AutoCaptioner.from_pretrained('path/to/checkpoints/CaRe-7B')
	frames = read_frames_decord(video_path='assets/demo.mp4', num_frames=32)
	description = captioner.describe(frames.unsqueeze(0))
	print(description[0])
	```

	### For Retrieval Tasks
	```python
	from utils.video import read_frames_decord
	from models.modeling_encoders import AutoEncoder
	from torch.nn.functional import cosine_similarity

	encoder = AutoEncoder.from_pretrained('path/to/checkpoints/CaRe-7B')
	frames = read_frames_decord(video_path='assets/demo.mp4', num_frames=32)
	text = "This video features a man slicing tomatoes in the kitchen."
	vision_emb = encoder.encode_vision(frames.unsqueeze(0))
	text_emb = encoder.encode_text(text)
	print(f'Vision embedding shape: {vision_emb.shape}')
	print(f'Text embedding shape: {text_emb.shape}')
	print(f'Cosine similarity: {cosine_similarity(vision_emb, text_emb)}')
	```